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Bayesian Nonparametric Inference of Switching 
Linear Dynamical Systems 

Emily Fox, Erik Sudderth, Michael Jordan, and Alan Willsky 

Abstract 

Many complex dynamical phenomena can be effectively modeled by a system that switches among a set of 
conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) 
and the switching vector autoregressive (VAR) process. Our Bayesian nonparametric approach utilizes a hierarchical 
Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We additionally employ 
automatic relevance determination to infer a sparse set of dynamic dependencies allowing us to learn SLDS with 
varying state dimension or switching VAR processes with varying autoregressive order. We develop a sampling 
algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the 
mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of 
dancing honey bees, the IBOVESPA stock index, and a maneuvering target tracking application. 
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I. Introduction 

LINEAR dynamical systems (LDSs) are useful in describing dynamical phenomena as diverse as human motion 
121, H, financial time-series O-Q, maneuvering targets J8], 0, and the dance of honey bees ifTOl . However, 
, such phenomena often exhibit structural changes over time, and the LDS models which describe them must also 

■ change. For example, a ballistic missile makes an evasive maneuver; a country experiences a recession, a central 
7— ( . bank intervention, or some national or global event; a honey bee changes from a waggle to a turn right dance. 

\ Some of these changes will appear frequently, while others are only rarely observed. In addition, there is always 

■ the possibility of a new, previously unseen dynamical behavior. These considerations motivate us to develop a 
00 ■ Bayesian nonparametric approach for learning switching LDS (SLDS) models. We also consider a special case 
CO \ of the SLDS — the switching vector autoregressive (VAR) model — in which direct observations of the underlying 

1 dynamical process are assumed available. 

One can view the SLDS, and the simpler switching VAR process, as an extension of hidden Markov models 
(HMMs) in which each HMM state, or mode, is associated with a linear dynamical process. While the HMM 
makes a strong Markovian assumption that observations are conditionally independent given the mode, the SLDS 
and switching VAR processes are able to capture more complex temporal dependencies often present in real data. 
Most existing methods for learning SLDS and switching VAR processes rely on either fixing the number of HMM 

■ modes, such as in IfTOl . or considering a change-point detection formulation where each inferred change is to a new, 
previously unseen dynamical mode, such as in ifTTTl . In this paper we show how one can remain agnostic about the 
number of dynamical modes while still allowing for returns to previously exhibited dynamical behaviors. 

Hierarchical Dirichlet processes (HDP) can be used as a prior on the parameters of HMMs with unknown mode 
space cardinality HH, lfl3l . In this paper we use a variant of the HDP-HMM — the sticky HDP-HMM of |14] — that 
provides improved control over the number of modes inferred; such control is crucial for the problems we examine. 
Our Bayesian nonparametric approach for learning switching dynamical processes extends the sticky HDP-HMM 
formulation to learn an unknown number of persistent dynamical modes and thereby capture a wider range of 
temporal dependencies. We then explore a method for learning which components of the underlying state vector 
contribute to the dynamics of each mode by employing automatic relevance determination (ARD) lfT31 - |[T7l . The 
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resulting model allows for learning realizations of SLDS that switch between an unknown number of dynamical 
modes with possibly varying state dimensions, or switching VAR processes with varying autoregressive orders. 

A. Previous System Identification Techniques 

Paoletti et. al. fT8] provide a survey of recent approaches to identification of switching dynamical models. The 
most general formulation of the problem involves learning: (i) the number of dynamical modes, (ii) the model 
order, and (iii) the associated dynamic parameters. For noiseless switching VAR processes, Vidal et. al. |[T9l present 
an exact algebraic approach, though relying on fixing a maximal mode space cardinality and autoregressive order. 
Psaradakis and Spagnolog [20] alternatively consider a penalized likelihood approach to identification of stochastic 
switching VAR processes. 

For SLDS, identification is significantly more challenging, and methods typically rely on simplifying assumptions 
such as deterministic dynamics or knowledge of the mode space. Huang et. al. |j2TI present an approach that assumes 
deterministic dynamics and embeds the input/output data in a higher-dimensional space and finds the switching times 
by segmenting the data into distinct subspaces Il22l . Kotsalis et. al. |23l develop a balanced truncation algorithm 
for SLDS assuming the mode switches are i.i.d. within a fixed, finite set; the authors also present a method for 
model-order reduction of HMMfl In ||25l , a realization theory is presented for generalized jump-Markov linear 
systems (GJMLS) in which the dynamic matrix depends both on the previous mode and current mode. Finally, when 
the number of dynamical modes is assumed known, Ghahramani and Hinton [26] present a variational approach 
to segmenting the data into the linear dynamical regimes and learning the associated dynamic parameters^. For 
questions of observability and identifiability of SLDS in the absence of noise, see ll27l . 

In the Bayesian approach that we adopt, we coherently incorporate noisy dynamics and uncertainty in the mode 
space cardinality. Our choice of prior penalizes more complicated models, both in terms of the number of modes and 
the state dimension describing each mode, allowing us to distinguish between the set of equivalent models described 
in (271. Thus, instead of placing hard constraints on the model, we simply increase the posterior probability of 
simpler explanations of the data. As opposed to a penalized likelihood approach using Akaike 's information criterion 
(AIC) |[28l or the Bayesian information criterion (BIC) (29l , our approach provides a model complexity penalty in 
a purely Bayesian manner. 

In Sec. |II1 we provide background on the switching linear dynamical systems we consider herein, and previous 
Bayesian nonparametric methods of learning HMMs. Our Bayesian nonparametric switching linear dynamical 
systems are described in Sec. JII] We proceed by analyzing a conjugate prior on the dynamic parameters, and 
a sparsity-inducing prior that allows for variable-order switching processes. The section concludes by outlining 
a Gibbs sampler for the proposed models. In Sec. [TV] we present results on synthetic and real datasets, and in 
Sec. |V] we analyze a set of alternative formulations that are commonly found in the maneuvering target tracking 
and econometrics literature. 



A. Switching Linear Dynamic Systems 

A state space (SS) model consists of an underlying state, x t € M. n , with dynamics observed via y t G M. d . A 
linear time-invariant (LTI) SS model is given by 



II. Background 



x t = Ax t -i + e t y t = Cx t + w t 



(1) 



where e t and w t are independent Gaussian noise processes with covariances £ and R, respectively. 
An order r VAR process, denoted by VAR(r), with observations y t E can be defined as 



r 





(2) 



i=l 



'The problem of identification of HMMs is thoroughly analyzed in 1241 . 

2 This formulation uses a mixture of experts SLDS in which M different continuous-valued state sequences evolve independently with 
linear dynamics and the Markovian dynamical mode selects which state sequence is observed at a given time. 
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Every VAR(r) process can be described in SS form, though not every SS model may be expressed as a VAR(r) 
process for finite r 11301 . 

The dynamical phenomena we examine in this paper exhibit behaviors better modeled as switches between a set 
of linear dynamical models. We define a switching linear dynamical system (SLDS) by 

z t | zt-i ~ vr 2t _ 1 
x t = A^xt-i + e t (z t ) y t = Cx t + w t . 

The first-order Markov process z t with transition distributions {ttj} indexes the mode-specific LDS at time t, which 
is driven by Gaussian noise e t (z t ) ~ A/"(0, E^*)). One can view the SLDS as an extension of the classical hidden 
Markov model (HMM) (3TJ, which has the same mode evolution, but conditionally independent observations: 

Zt I Zt-l ~ ^Zt-i 

y t \z t ~F{9 Zt ) 

for an indexed family of distributions F(-) where 0% are the emission parameters for mode i. 
We similarly define a switching VAR(r) process by 

Z t | Zt-l ~ TT Zt _ 1 

i=i 



B. Dirichlet Processes and the Sticky HDP-HMM 

To examine a Bayesian nonparametric SLDS, and thus relax the assumption that the number of dynamical modes 
is known and fixed, it is useful to first analyze such methods for the simpler HMM. One can equivalently represent 
the finite HMM of Eq. (0]) via a set of transition probability measures Gj = Y^k=x 7r ifc^e fc > where 5g is a mass 
concentrated at 0. We then operate directly in the parameter space © and transition between emission parameters 
with probabilities given by {Gj}. That is, 

Of I 0+ 1 ~ Ga-ri —f) . 

t I t-i j.o t l -v, (6) 

vt I % ~ Fie't)- 

Here, Ql € {0\, ... ,0k} and is equivalent to Zt of Eq. (0]). A Bayesian nonparametric HMM takes Gj to be 
random^ with an infinite collection of atoms corresponding to the infinite HMM mode space. 

The Dirichlet process (DP), denoted by DP(7, H), provides a distribution over discrete probability measures 
with an infinite collection of atoms 

oo 

G = J2^e k Ok ~H, (7) 

k=l 

on a parameter space ©. The weights are sampled via a stick-breaking construction I32]: 

fc-i 

Pk = VkJlO- ~ "i) f fe ~Beta(l, 7 ). (8) 

t=x 

In effect, we have divided a unit-length stick into lengths given by the weig hts p k : the k th weight is a random 
proportion v k of the remaining stick after the previous (k — X) weights have been defined. We denote this distribution 

by ~ GEM (7). 

The Dirichlet process has proven useful in many applications due to its clustering properties, which are clearly 
seen by examining the predictive distribution of draws 0\ ~ Gq. Because probability measures drawn from a 
Dirichlet process are discrete, there is a strictly positive probability of multiple observations 0\ taking identical 
values within the set {Ok}, with Op. defined as in Eq. (O. For each value 0' { , let zi be an indicator random variable 

'Formally, a random measure on a measurable space O with sigma algebra A is defined as a stochastic process whose index set is A. 
That is, G(A) is a random variable for each A 6 A. 
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Fig. 1. Sticky HDP-HMM prior on (a) switching VAR(2) and (b) SLDS processes with the mode evolving as Zt+i\{TTk}kLii z t ~ 
7r Zt for TTk\a, K)/3 ~ DP(a + k, (a/3 + nSk) /(ot+ k)). Here, /3 | 7 ~ GEM(7) and 6*fe | iJ, A ~ H(X). The dynamical processes 
are as in Table U 



that picks out the unique value 9k such that 6[ = 6 Zi . Blackwell and MacQueen 11331 introduced a Polya urn 
representation of the 9\: 

0[ I 9[, BU ~ — + ^ ~^—M = —^—tH + —^6e h . (9) 

7 + z — 1 7 + z — 1 3 7 + z — 1 z — ' 7 + 1 — 1 

' i=i fe=l 

Here, is the number of observations Q' { taking the value 9k- From Eq. (©, and the discrete nature of Go, we see a 
reinforcement property of the Dirichlet process that induces sparsity in the number of inferred mixture components. 

A hierarchical extension of the Dirichlet process, the hierarchical Dirichlet process (HDP) lfT2l . has proven useful 
in defining a prior on the set of HMM transition probability measures Gj. The HDP defines a collection of probability 
measures {Gj} on the same support points {9\, 62, ■ ■ ■ } by assuming that each discrete measure Gj is a variation 
on a global discrete measure Go- Specifically, the Bayesian hierarchical specification takes Gj ~ DP(a,Go), with 
Go itself a draw from a Dirichlet process DP (7, H). Through this construction, one can show that the probability 
measures are described as 

Go = EfcLi Pk6 $k P \ 7 ~ GEM( 7 ) . 

Applying the HDP prior to the HMM, we obtain the HDP-HMM of Teh et. al. [12]. This corresponds to the model 
in Fig. (2a), but without the edges between the observations. 

By defining -Kj ~ DP(a,/3), the HDP prior encourages modes to have similar transition distributions. Namely, 
the mode-specific transition distributions are identical in expectation: 

E[7T jk I 13} = fa- (11) 

However, it does not differentiate self-transitions from moves between modes. When modeling dynamical processes 
with mode persistence, the flexible nature of the HDP-HMM prior allows for mode sequences with unrealistically 
fast dynamics to have large posterior probability. Recently, it has been shown lfl4l that one may mitigate this 
problem by instead considering a sticky HDP-HMM where -Kj is distributed as follows: 

a + K,— — — 3 -\. (12) 

a + k ) 

Here, (a(3 + ndj) indicates that an amount k > is added to the j th component of a(3. This construction increases 
the expected probability of self-transition by an amount proportional to k. Specifically, the expected set of weights 
for transition distribution nj is a convex combination of those defined by j3 and mode-specific weight defined by 

E[n jk I p,a,K] = ^—h + ——6(j,k). (13) 
a + k a + k 

When k = the original HDP-HMM of Teh et. al. |[T2l is recovered. We place a prior on k and learn the 
self-transition bias from the data. 
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HDP-AR-HMM 


HDP-SLDS 


Mode dynamics 
Observation dynamics 


Zt Zt-l ~ 7Tz t ! 

V t = J2U 1 A ( i zt) y t _ l + e t (z t ) 


Zt Zt-l ~ 7T Zt _i 

x t = A (zt) a; t _i + e t (z t ) 
y t = Ca;i + w t 



TABLE I 

Dynamic equations for the HDP-AR-HMM and HDP-SLDS. Here, tvj is as defined in Eq. (O for the sticky HDP-HMM. 
The additive noise processes are distributed as e t (k) ~ A/"(0, E (fc) ) and w t ~ A/"(0, -R). 





HDP-AR-HMM 


HDP-SLDS 


Dynamic matrix 
Pseudo-observations 
Lag pseudo-observations 


A (k) = [ A W __, A W] £K dx(d,r) 

V»* = y t 

$t = lvI-i---y?-r} T 


J±(k) _ A (k) g gnxn 

V't = x t 
tf> t = a5t_i. 



TABLE II 

NOTATIONAL CONVENIENCES USED IN DESCRIBING THE GIBBS SAMPLER FOR THE HDP-AR-HMM AND HDP-SLDS. 

III. The HDP-SLDS and HDP-AR-HMM 

We now consider a significant extension of the sticky HDP-HMM for both SLDS and VAR modeling, capturing 
dynamic structure underlying the observations by allowing switching among unknown number of unknown dynamics 
using Bayesian nonparametric methods to capture these uncertainties (and to allow both learning the number of 
modes and estimating system state). Fig. QJb) illustrates the HDP-SLDS model, while Fig. (Ha) illustrates the 
HDP-AR-HMM model (for the case of VAR(2)). The generative processes for these two models are summarized in 
Table H 

For the HDP-SLDS, we place priors on the dynamic parameters {A^ k \ T,^} and on measurement noise R 
and infer their posterior from the data. However, without loss of generality^, we fix the measurement matrix to 
C = [Id 0] implying that it is the first d components of the state that are measured. Our choice of the state dimension 
n is, in essence, a choice of model order, and an issue we address in Sec. IIII-A2I For the HDP-AR-HMM, we 
similarly place a prior on the dynamic parameters, which in this case consist of {A^\ . . . , A^\ T,^}. 

In Sec. IIII-B I we derive a Gibbs sampling inference scheme for our models. There is, of course, a difference 
between the steps required for SLDS-based model (in which there is an unobserved continuous-valued state x t ) 
and the AR-based model. In particular, for the HDP-SLDS the algorithm iterates among the following steps: 

1) Sample the state sequence x\-t given the mode sequence z\-t and SLDS parameters {A^ k \ Y,( k \ R}. 

2) Sample the mode sequence Z\-t given the state sequence x\-t, HMM parameters {vr^}, and dynamic param- 
eters {y#),£( fc )}. 

3) Sample the HMM parameters {irk} and SLDS parameters {A^ k ' ,H^ k \ R} given the sequences, z\-t, x\.t, 
and y VT . 

For the HDP-AR-HMM, step (1) does not exist. Step (2) then involves sampling the mode sequence z\-t given 
the observations y VT (rather than x\-t), and step (3) involves conditioning solely on the sequences Z\-t and 
y VT (not 2Ji:t)- Also, we note that step (2) involves a fairly straightforward extension of the sampling method 
developed in lfl4l for the simpler HDP-HMM model; the other steps, however, involve new constructs, as they 
involve capturing and dealing with the temporal dynamics of the underlying continuous state models. Sec. IIII-AI 
provides the necessary priors and structure of the posteriors needed to develop these steps. 

A. Priors and Posteriors of Dynamic Parameters 

We begin by developing a prior to regularize the learning of the dynamic parameters (and measurement noise) 
conditioned on a fixed mode assignment z\ : t- To make the connections between the samplers for the HDP-SLDS and 
HDP-AR-HMM explicit, we introduce the concept of pseudo-observations "J^lt an d rewrite the dynamic equation 
for both the HDP-SLDS and HDP-AR-HMM generically as 

1>t = A (fc) ^_x + e t , (14) 

where we utilize the definitions outlined in Table JI] 

4 This is, in essence, an issue of choosing a similarity transformation for the state of a minimal system, exploiting the fact that the 
measurement matrix is shared by all modes of the HDP-SLDS so that the same transformation can be used for all modes. 
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For the HDP-AR-HMM, we have simply written the dynamic equation in Table U in matrix form by concatenating 
the lag matrices into a single matrix A^ and forming a lag observation vector -ip t comprised of a series of previous 
observation vectors. For this section (for the HDP-SLDS), we assume such a sample of the state sequence x\ : t (and 
hence {V> t ,$J) is available so that Eq. CU) applies equally well to both the HDP-SLDS and the HDP-AR-HMM. 
Methods for resampling this state sequence are discussed in Sec. IIII-BI 

Conditioned on the mode sequence, one may partition this dynamic sequence into K different linear regression 
problems, where K = \{z\, . . . ,Zr}|. That is, for each mode k, we may form a matrix with n k columns 
consisting of the %j) t with z t = k. Then, 

\j,(k) _ j±(k)=,(k) + E (fe) 

where 5K fc ) is a matrix of the associated V> t _i, and E( fc ) the associated noise vectors. 

1) Conjugate Prior on {A^ k \ The matrix-normal inverse-Wishart (MNIW) prior f34l is conjugate to the 
likelihood model defined in Eq. (031 ) for the parameter set {A^,Sw}, Although this prior is typically used for 
inferring the parameters of a single linear regression problem, it is equally applicable to our scenario since the linear 
regression problems of Eq. (fl5T ) are independent conditioned on the mode sequence z\-t- We note that although 
the MNIW prior does not enforce stability constraints on each mode, this prior is still a reasonable choice since 
each mode need not have stable dynamics for the SLDS to be stable IT351 . and conditioned on data from a stable 
mode, the posterior distribution will likely be sharply peaked around stable dynamic matrices. 

Let D( fc ) = {^w, i&( k ^}. The posterior distribution of the dynamic parameters for the k th mode decomposes as 

p(AW,EW I B^)=p(A^ | £W,D( fc ))p(£( fc ) | DW). (16) 
The resulting posterior of A^ fe ^ is straightforwardly derived to be (see 11361 ) 

P (AW I sW,DW)=^(A«;S^,sW,sg , (17) 

with B~^ denoting (B^)^ 1 for a given matrix B, M.M {A; M, K,V) denoting a matrix-normal distribution 
with mean matrix M and left and right covariances K and V, and 

S g = *( fc )*( fc ) T +^ S jg = *W*W r +MK SW=*W*W T +MKM T . (18) 
The marginal posterior of T,^ is 

p(E (fc) | D (*)) = iw (n k + n , S« + So) , (19) 

where IW(no, So) denotes an inverse-Wishart prior with no degrees of freedom and scale matrix So, and is updated 
by data terms S« = S« - sgs^S^f and n k = \{t | z t = k, t = 1, . . . , T}\. 

2) Alternative Prior — Automatic Relevance Determination: The MNIW prior leads to full A^ matrices, which 
(i) becomes problematic as the model order grows in the presence of limited data; and (ii) does not provide a method 
for identifying irrelevant model components (i.e. state components in the case of the HDP-SLDS or lag components 
in the case of the HDP-AR-HMM.) To jointly address these issues, we alternatively consider automatic relevance 
determination (ARD) lfT51 ~ lfT71 , which encourages driving components of the model parameters to zero if their 
presence is not supported by the data. 

For the HDP-SLDS, we harness the concepts of ARD by placing independent, zero-mean, spherically symmetric 
Gaussian priors on the columns of the dynamic matrix A^: 

n 

P (A^\a^) = Y[M(a^;0,afh n ). (20) 

3=1 

Each precision parameter a~p is given a Gamma(a, b) prior. The zero-mean Gaussian prior penalizes non-zero 
columns of the dynamic matrix by an amount determined by the precision parameters. Iterative estimation of these 
hyperparameters and the dynamic matrix A^ leads to becoming large for columns whose evidence in the 

(k) tk) 

data is insufficient for overcoming the penalty induced by the prior. Having aj ' oo drives a - 0, implying 
that the j th state component does not contribute to the dynamics of the k th mode. Thus, examining the set of 
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large provides insight into the order of that mode. Looking at the k th dynamical mode alone, having ap = 
implies that the realization of that mode is not minimal since the associated Hankel matrix 

U=[C T CA T ■■■ {CA d ~ l ) T ] T [G AG ■■■ A d - 1 G]=OK (21) 

has reduced rank. However, the overall SLDS realization may still be minimal. 

For our use of the ARD prior, we restrict attention to models satisfying the property that the state components 
that are observed are relevant to all modes of the dynamics: 

(k) 

Criterion 3.1: If for some realization 1Z a mode k has a - = 0, then that realization must have Cj = 0, where Cj 
is the j th column of C. Here we assume, without loss of generality, that the observed states are the first components 
of the state vector. 

This assumption implies that our choice of C = [Id 0] does not interfere with learning a sparse realization^- 
The ARD prior may also be used to learn variable-order switching VAR processes. Here, the goal is to "turn off" 

(k) 

entire lag blocks A\ (whereas in the HDP-SLDS we were interested in eliminating columns of the dynamic matrix.) 
Instead of placing independent Gaussian priors on each column of A^ fc ^ as we did in Eq. (l20l ). we decompose the 
prior over the lag blocks A\ . 



(A( fe )|c*( fc )) = n^(vec(4 fc >);0,a7( fe )l d2 ) . (22) 



Pi 

Since each element of a given lag block A\ is distributed according to the same precision parameter a\ , if that 
parameter becomes large the entire lag block will tend to zero. 

In order to examine the posterior distribution on the dynamic matrix A^ k \ it is useful to consider the Gaussian 
induced by Eq. d20l > and Eq. (1221 on a vectorization of A^ k \ Our ARD prior on A^ is equivalent to a M(0, ) 
prior on vec(A^), where 

4 &) = diag (a[ k) .....af,...,^,-,^)" 1 . (23) 
Here, m = n for the HDP-SLDS with n replicates of each a\ k \ and m = r for the HDP-AR-HMM with d? 

(k) 

replicates of a\ . (Recall that n is the dimension of the HDP-SLDS state vector x t , r the autoregressive order of 
the HDP-AR-HMM, and d the dimension of the observations y t .) To examine the posterior distribution of A^ k \ 
we note that we may rewrite the state equation as, 

i>t+i = [ $t,ih $t,ih ■ ■ ■ $t,t*rh ] vec(A (fc) ) + e t+ i(k) \ft\z t = k 

^§ t vec(A( fc )) + e t+1 (A0, (24) 



where I = n for the HDP-SLDS and £ = d for the HDP-AR-HMM. Using Eq. (|24il . we derive the posterior 
distribution as 



p(vec( 



A W) I D (*) j S « a W) =N- 1 (Y j *f-iS- (fc) Vt, + E *£iE- (fc) *t-i) • (25) 



t\z t =h t\z t - 



See ||36l for a detailed derivation. Here, M 1 represents a Gaussian A/"(/i,S) with information parameters 

t? = £ - V an d A = X -1 . Given A^ k \ and recalling that each precision parameter is gamma distributed, the 

(k) 

posterior of a\ is given by 

p(aP | A( fc )) = Gamma ( a + i^l, b + t^M^I a V ] . (26 ) 



(k) (k) 

The set Se contains the indices for which a\- has prior precision a\ . Note that in this model, regardless of 

(k) 

the number of observations y t , the size of Si (i.e., the number of a- used to inform the posterior distribution) 

5t 



If there does not exist a realization 7Z satisfying Criterion 13. II we may instead consider a more general model where the measurement 
equation is mode-specific and we place a prior on C' fe ' instead of fixing this matrix. However, this model leads to identifiability issues that 
are considerably less pronounced in the above case. 
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remains the same. Thus, the gamma prior is an informative prior and the choice of a and b should depend upon 
the cardinality of For the HDP-SLDS, this cardinality is given by the maximal state dimension n, and for the 
HDP-AR-HMM, by the square of the observation dimensionality d 2 . 

We then place an inverse-Wishart prior IW(rto, Sq) on T,^ and look at the posterior given A^: 

p(E (fc) | D « A «) = IW (n k + n , S« + 5 ) , (27) 

where here, as opposed to in Eq. ( fT9l , we define 

S % = E (*t - A^m)(^ - A<^ W ) T (28) 

t|zt=fe 

3) Measurement Noise Posterior: For the HDP-SLDS, we additionally place an IW(ro,-Ro) prior on the mea- 
surement noise covariance R. The posterior distribution is given by 

p(R | y l:T , sbi-t) = IW(T + r , + i? ), (29) 

where Sr = Y^t=i{Vt ~ Cx t )(y t — Cx t ) T . Here, we assume that R is shared between modes. The extension to 
mode-specific measurement noise is straightforward. 

B. Gibbs Sampler 

For inference in the HDP-AR-HMM, we use a Gibbs sampler that iterates between sampling the mode sequence, 
z±:T, and the set of dynamic and sticky HDP-HMM parameters. The sampler for the HDP-SLDS is identical with the 
additional step of sampling the state sequence, x\-t, and conditioning on this sequence when resampling dynamic 
parameters and the mode sequence. Periodically, we interleave a step that sequentially samples the mode sequence 
z\-t marginalizing over the state sequence x\-t in a similar vein to that of Carter and Kohn 11371 . We describe the 
sampler in terms of the pseudo-observations tp t , as defined by Eq. (fl4l ). in order to clearly specify the sections of 
the sampler shared by both the HDP-AR-HMM and HDP-SLDS. 

1) Sampling Dynamic Parameters {A^ k \ E^}: Conditioned on the mode sequence, z\.,t, and the pseudo- 
observations, ipi-T, we can sample the dynamic parameters = {A",S"} from the posterior densities of 
Sec. IIII- A I For the ARD prior, we then sample given A^ k \ In practice we iterate multiple times between 
sampling given A^ and A^ given ot^ before moving to the next sampling stage. 

2) Sampling Measurement Noise R (HDP-SLDS only): For the HDP-SLDS, we additionally sample the mea- 
surement noise covariance R conditioned on the sampled state sequence x±._t- 

3) Block Sampling Z\-t: As shown in |[T4l . the mixing rate of the Gibbs sampler for the HDP-HMM can be 
dramatically improved by using a truncated approximation to the HDP and jointly sampling the mode sequence 
using a variant of the forward-backward algorithm. In the case of our switching dynamical systems, we must account 
for the direct correlations in the observations in our likelihood computation. The variant of the forward-backward 
algorithm we use here then involves computing backward messages mt+i t t(zt) oc p(ij>t+l:T\ z ti'&t) 7r > ^) f° r eacn 
Zt 6 {1, . . . , L} with L the chosen truncation level, followed by recursively sampling each z t conditioned on Zt-\ 
from 

p(zt | *t_i,^ 1:Tl 7r,0) <xp(zt | Tr^XVt I ^ t _ 1 ,AW,sC*))m t+ i, t (« t ). (30) 

Joint sampling of the mode sequence is especially important when the observations are directly correlated via a 
dynamical process since this correlation further slows the mixing rate of the sequential sampler of Teh et. al. lfT2l . 
Note that using an order L weak limit approximation to the HDP still encourages the use of a sparse subset of the 
L possible dynamical modes. 

4) Block Sampling x\-t (HDP-SLDS only): Conditioned on the mode sequence z\-t and the set of SLDS 
parameters 9 = {A^ ,^ h \ R}, our dynamical process simplifies to a time-varying linear dynamical system. We can 
then block sample x\ : t by first running a backward Kalman filter to compute mt+ij{xt) oc p(y t+1:T \x t , zt+i-.r, &) 
and then recursively sampling each x t conditioned on Xt-i from 

p(x t | x t -i, y 1:T , zi-t, 0) oc p(x t \ x t -x, A^ Zt \ Y, { - Zt) )p(y t \ x t , R)m t+U (x t ). (31) 
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The messages are given in information form by mt,t-i[&t-l) oc TV 1 (xt-i; #t,t-i, A tit _x), where the information 
parameters are recursively defined as 

t(t _i = A^ T T,-^A t (C T R- 1 y t + tf t+M ) 

A t ,t_i = ^( z *) T s-( 2t )A (2t) - A {zt)1 ^lT (2t) A 4 $r (2t) A^ Zt) , 

with At = (E~( 2 *) + C T R~ l C + Af+i^) -1 . The standard and A^ updated information parameters for a 
backward running Kalman filter are given by 

A* )t = C T R- l C + A t+M 

^| t = C T i2-V + ^t+i,t. (33) 



See [36] for a derivation and for a more numerically stable version of this recursion. 

5) Sequentially Sampling z\-_t (HDP-SLDS only): For the HDP-SLDS, iterating between the previous sampling 
stages can lead to slow mixing rates since the mode sequence is sampled conditioned on a sample of the state se- 
quence. For high-dimensional state spaces M n , this problem is exacerbated. Instead, one can analytically marginalize 
the state sequence and sequentially sample the mode sequence from p(z t \ z\ t ,yi :T ,ir,0). This marginalization is 
accomplished by once again harnessing the fact that conditioned on the mode sequence, our model reduces to a time- 
varying linear dynamical system. When sampling z t and conditioning on the mode sequence at all other time steps, 
we can run a forward Kalman filter to marginalize the state sequence x\-_t~2 producing p{x t -i \ y\-t-i-,Z\-t-i,9), 
and a backward filter to marginalize x t +i-.T producing p(y t +± : T I x t, z t+i-.T, &)• Then, for each possible value of 
Zt, we combine these forward and backward messages with the local likelihood p(y t j x t ) and local dynamic 
pixt | Xt-i,0, Zt = k) and marginalize over x t and Xt-\ resulting in the likelihood of the observation sequence 
y 1:T as a function of z t . This likelihood is combined with the prior probability of transitioning from zt—\ to z t = k 
and from z t = k to Zt+i. The resulting distribution is given by: 

p(z t = k | z\ t ,y 1:T ,7T,e) oc vr Zt _ 1 (/c)vr fe (2; t+ i) 
|Aj 

with 



|A ( fc)+Afe |1/2 exp V*^r + 2^ + V + K\t)~ i^t + *?| t ) } (34. 



(35) 



A f) = (eW+aW^aW 1 )- 1 
# = (s«m^U^ 1h aWY 1 aWa-/ |m <_ 1h . 

See IT361 for full derivations. Here, i% t and A^ are the updated information parameters for a forward running 
Kalman filter, defined recursively as 

A* = C T R- X C + Yr^ - Z~( z t)A( Zt \A^ T Z-^A^ + A^u^A^Y^ 

$1 = C T R- X y t + YT^ A^ Zt \A^ T YT^ A^ + A i / _ 1|t _ 1 )- 1 ^ / _ 1|t _ r (36) 

Note that a sequential node ordering for this sampling step allows for efficient updates to the recursively defined 
filter parameters. However, this sequential sampling is still computationally intensive, so our Gibbs sampler iterates 
between blocked sampling of the state and mode sequences many times before interleaving a sequential mode 
sequence sampling step. 

The resulting Gibbs sampler is outlined in Algorithm Q] 

IV. Results 

A. MNIW prior 

We begin by examining a set of three synthetic datasets displayed in Fig. |2£a) in order to analyze the relative 
modeling power of the HDP-VAR(l)-HMMl HDP-VAR(2)-HMM, and HDP-SLDS using the MNIW prior. We 
compare to a baseline sticky HDP-HMM using first difference observations, imitating a HDP-VAR(1)-HMM with 



6 We use the notation HDP-VAR(r)-HMM to refer to an order r HDP-AR-HMM with vector observations. 



MIT LIDS TECHNICAL REPORT #2830 



10 



Given a previous set of mode-specific transition probabilities 7n n x \ the global transition distribution f3^ n l \ and 
dynamic parameters 9^ n ~ 1 ': 

1) Set 7T = tt^- 1 ), (3 = p^-V, and 6 = fit"" 1 ). 

2) IfHDP-SLDS, 

a) For each t G {1, . . . , T}, compute {t?^, A^ 4 } as in Eq. (l36l ). 

b) For each i G {T, . . . , 1}, 

i) Compute {tfL, A|,J as in Eq. (|33l). 

ii) For each fc G {1, . . . , L}, compute {df \ K^} as in Eq. (l35l ) and set 

h(yi,T) = \^ ] \ 1/2 \^ } + Kt\- 1/2 



cxp 

iii) Sample a mode assignment 



itf«V*>0« + ±(*<*> + (A^ + A^r^W + ^)) . 



L 

Zi ~ ^7r Zt _ 1 (/c)vr fc (2; t+ i)/ fc (^ 1:T )(5(zt,fc). 
fc=i 

c) Working sequentially forward in time sample 

xt ~ A/"(a; t ; (E~(*> + A^-^"^^*-! + ^ |t ), + A^)" 1 ). 

d) Set pseudo-observations tpi-x = xv.T- 

3) If HDP-AR-HMM, set pseudo-observations ip VT = y 1:T . 

4) Block sample given transition distributions 7r, dynamic parameters 0, and pseudo-observations V'hT as 
in Algorithm |2] 

5) Update the global transition distribution (3 (utilizing auxiliary variables m, w, and m), mode-specific 
transition distributions ir^, and hyperparameters a, 7, and k as in lfT4l . 

6) For each G {1, . . . , L}, sample dynamic parameters (A^ k \ Y,^) given the pseudo-observations Vi:T an( ^ 
mode sequence as in Algorithm [3] for the MNIW prior and Algorithm [4] for the ARD prior. 

7) If HDP-SLDS, also sample the measurement noise covariance 

R ~ IW ( / r + r ,^(^-C^)(^-C7^) T + J R o y 



8) Fix vr( n ) = 7T, = p, and (n) = 0. 



Algorithm 1: HDP-SLDS and HDP-AR-HMM Gibbs sampler. 



= I for all k. In Fig. I2b)-(e) we display Hamming distance errors that are calculated by choosing the optimal 
mapping of indices maximizing overlap between the true and estimated mode sequences. 

We place a Gamma(a, b) prior on the sticky HDP-HMM concentration parameters a + k and 7, and a Beta(c, d) 
prior on the self-transition proportion parameter p = k/(o. + k). We choose the weakly informative setting of a = 1, 
b = 0.01, c = 10, and d = 1. The details on setting the MNIW hyperparameters from statistics of the data are 
discussed in the Appendix. 

For the first scenario (Fig. |2] (top)), the data were generated from a five-mode switching VAR(l) process with 
a 0.98 probability of self-transition and equally likely transitions to the other modes. The same mode-transition 
structure was used in the subsequent two scenarios, as well. The three switching linear dynamical models provide 
comparable performance since both the HDP-VAR(2)-HMM and HDP-SLDS with C = I3 contain the class of 
HDP-VAR(l)-HMMs. In the second scenario (Fig. [2] (middle)), the data were generated from a 3-mode switching 
AR(2) process. The HDP-AR(2)-HMM has significantly better performance than the HDP-AR(1)-HMM while the 
performance of the HDP-SLDS with C = [1 0] performs similarly, but has greater posterior variability because the 
HDP-AR(2)-HMM model family is smaller. Note that the HDP-SLDS sampler is slower to mix since the hidden, 
continuous state is also sampled. The data in the third scenario (Fig. [2] (bottom)) were generated from a three-mode 
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Given mode-specific transition probabilities tv, dynamic parameters 6, and pseudo-observations Vi-t : 
1) Calculate messages mt.t-i(k), initialized to mr+\,T{k) = 1> and the sample mode sequence zi.t- 

a) For each t € {T, . . . , 1} and k E {1, . . . , L}, compute 

m t>t -i(k) = ^k(jW Ut^A^^^A m t+1 , t (j) 

j=l V i=l J 

b) Working sequentially forward in time, starting with transitions counts = 0: 
i) For each k E {1, . . . , L}, compute the probability 

CO. 



i=i 



ii) Sample a mode assignment z t as follows and increment n Zt _ lZt : 

L 



z t ~ ^7r 2t _ 1 (/c)/ fc (V' t )5(z t ,A;) 



fc=l 



Note that the likelihoods can be precomputed for each k £ {1, . . . , L}. 



Algorithm 2: Blocked mode-sequence sampler for HDP-AR-HMM or HDP-SLDS. 



Given pseudo-observations t/'lt an ^ mode sequence z\-t, for each k G {1, . . . ,i^}: 

1) Construct *W and *W as in Eq. (115b - 

2) Compute sufficient statistics using pseudo-observations ■j/'j associated with z t = k: 

S f T = + K S (k) T = *W*( fe ) T + MK S ( S = *W*W T + MKM T . 

3) Sample dynamic parameters: 



EW ~ IW (n fc + no, S« + Sb) A« | E« ~ MAA (a«; S«s4*\ £« S 



(fc) 



Algorithm 3: Parameter sampling using MNIW prior. 



SLDS model with C = I 3 . Here, we clearly see that neither the HDP-VAR(1)-HMM nor HDP-VAR(2)-HMM is 
equivalent to the HDP-SLDS. Note that all of the switching models yielded significant improvements relative to the 
baseline sticky HDP-HMM. This input representation is more effective than using raw observations for HDP-HMM 
learning, but still much less effective than richer models which switch among learned LDS. Together, these results 
demonstrate both the differences between our models as well as the models' ability to learn switching processes 
with varying numbers of modes. 



B. ARD prior 

We now compare the utility of the ARD prior to the MNIW prior using the HDP-SLDS model when the true 
underlying dynamical modes have sparse dependencies relative to the assumed model ordei0. We generated data 
from a two-mode SLDS with 0.98 probability of self-transition and 



A (i) 



0.8 
-0.2 




-0.2 
0.8 




A (2) 



-0.2 
0.8 




0.8 
-0.2 




7 That is, the HDP-SLDS may have dynamical regimes reliant on lower state dimensions, or the HDP-AR-HMM may have modes described 
by lower order VAR processes. 
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Given pseudo-observations tpi-j-, mode sequence z\-t, and a previous set of dynamic parameters 
(A(*), £(*),«*<*)), for each k € {1, . . . , K}: 

1) Construct ^> t as in Eq. (I24l 

2) Iterate multiple times between the following steps: 



a) Construct given cr ' as in Eq. ( 1231 ) and sample the dynamic matrix: 

vec(A«)|£« ^-AT^ ^I-i^- (k) ^o {k) + E ^-i^ (fc) ^t-i). 

^t\z t =k t\z t =k ' 

b) For each I € {1, ... , m}, with m = n for the SLDS and m = r for the switching VAR, sample ARD 
precision parameters: 



,(*) 



'/■•) a (fc) /- f 1^1 l ,^(»>i)es« "ij 
a; ; | A 1 ' 8 -' ~ Gamma | a + b H — 2 — - 



c) Compute sufficient statistic: 

g (fc)_ 



t|zt=/c 

and sample process noise covariance: 



S W| A W „ Iw(n fc + no,S^ + 5 



Algorithm 4: Parameter sampling using ARD prior. 




(a) 



(b) 



(c) 



(d) 



(e) 



Fig. 2. (a) Observation sequence (blue, green, red) and associated mode sequence (magenta) for a 5-mode switching VAR(l) process (top), 
3-mode switching AR(2) process (middle), and 3-mode SLDS (bottom). The associated 10th, 50th, and 90th Hamming distance quantiles 
over 100 trials are shown for the (b) HDP-VAR(1)-HMM, (c) HDP-VAR(2)-HMM, (d) HDP-SLDS with C = I (top and bottom) and 
C — [1 0] (middle), and (e) sticky HDP-HMM using first difference observations. 



with C = [I 2 0], EW = s(2) = j 3j and R = I 2 . The first dynamical process can be equivalently described by just 
the first and second state components since the third component is simply white noise that does not contribute to 
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(b) 



(c) 



(d) 



HARD hyper: x1 
HARD hyper: x2 
HARD hyper: x3 



(e) 



Fig. 3. (a) Observation sequence (green, blue) and mode sequence (magenta) of a 2-mode SLDS, where the first mode can be realized 
by the first two state components and the second mode solely by the first. The associated 10th, 50th, and 90th Hamming distance quantiles 
over 100 trials are shown for the (b) MNIW and (c) ARD prior, (d)-(e) Histograms of inferred ARD precisions associated with the first and 
second dynamical modes, respectively, at the 5000th Gibbs iteration. Larger values correspond to non-dynamical components. 



the state dynamics and is not directly (or indirectly) observed. For the second dynamical process, the third state 
component is once again a white noise process, but does contribute to the dynamics of the first and second state 
components. However, we can equivalently represent the dynamics of this mode as 

-0.2 0" 

xi,t = -u.lx ljt -i + ei )t ^ (2) 
x 2 ,t = 0.8xi it _i + e 2>t 



0.8 




where e t is a white noise term defined by the original process noise combined with X3 t t, and A^ 2 ) is the dynamical 
matrix associated with this equivalent representation of the second dynamical mode. Notice that this SLDS does not 
satisfy Criterion 13.11 since the second column of A^ 2 ) is zero while the second column of C is not. Nevertheless, 



(2) 

because the realization is in our canonical form with C = [I2 0], we still expect to recover the a 2 = a^' = 
sparsity structure. We set the parameters of the Gamma(a, 6) prior on the ARD precisions as a = \Sz\ and b = 
a/1000, where we recall the definition of from Eq. (|26l ). This specification fixes the mean of the prior to 1000 
while aiming to provide a prior that is equally informative for various choices of model order (i.e., sizes \Se\). 

In Fig. |3j we see that even in this low-dimensional example, the ARD provides superior mode-sequence estimates, 
as well as a mechanism for identifying non-dynamical state components. The histograms of the inferred aW are 



.( 2 ) 



shown in Fig. [Hd)-(e). From the clear separation between the sampled dynamic range of ai and (a\ , a 2 ), and 

(2) (2) (2) (i) 

between that of (a 2 , a 3 ) and q , we see that we are able to correctly identify dynamical systems with a 3 = 



J 1 ) „Q-h 



and a 



(2) 



a 



(2) 



0. 



C. Dancing Honey Bees 

Honey bees perform a set of dances within the beehive in order to communicate the location of food sources. 
Specifically, they switch between a set of waggle, turn-right, and turn-left dances. During the waggle dance, the 
bee walks roughly in a straight line while rapidly shaking its body from left to right. The turning dances simply 
involve the bee turning in a clockwise or counterclockwise direction. We display six such sequences of honey 
bee dances in Fig. |4] The data consist of measurements y t = [cos(f? t ) sm(9 t ) xt yt] T , where (xt,yt) denotes 
the 2D coordinates of the bee's body and 9 t its head ang led Both Oh et. al. OH and Xuan and Murphy El 
used switching dynamical models to analyze these honey bee dances. We wish to analyze the performance of our 
Bayesian nonparametric variants of these models in segmenting the six sequences into the dance labels displayed 
in Fig. H 

MNIW Prior — Unsupervised: We start by testing the HDP-VAR(1)-HMM using a MNIW prior. (Note that we 
did not see performance gains by considering the HDP-SLDS, so we omit showing results for that architecture.) 
We set the prior distributions on the dynamic parameters and hyperparameters as in Sec. IIV-AI for the synthetic 
data examples, with the MNIW prior based on a pre-processed observation sequence. The pre-processing involves 
centering the position observations around and scaling each component of y t to be within the same dynamic 
range. We compare our results to those of Xuan and Murphy ifTTTl . who used a change -point detection technique for 
inference on this dataset. As shown in Fig. [5jd) and (h), our model achieves a superior segmentation compared to 

8 The data are available at http://www.cc.gatech.edu/ borg/ijcv_psslds/. 
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Fig. 5. (a)-(c) The 10th, 50th, and 90th Hamming distance quantiles over 100 trials are shown for sequences 4, 5, and 6, respectively. 
(e)-(g) Estimated mode sequences representing the median error for sequences 4, 5, and 6 at the 200th Gibbs iteration, with errors indicated 
in red. (d) and (h) ROC curves for the unsupervised HDP-VAR-HMM, partially supervised HDP-VAR-HMM, and change-point formulation 
of II II using the Viterbi sequence for segmenting datasets 1-3 and 4-6, respectively. 



the change -point formulation in almost all cases, while also identifying modes which reoccur over time. Oh et. al. 
ifTOl also presented an analysis of the honey bee data, using an SLDS with a fixed number of modes. Unfortunately, 
that analysis is not directly comparable to ours, because Oh et. al. [10] used their SLDS in a supervised formulation 
in which the ground truth labels for all but one of the sequences are employed in the inference of the labels for 
the remaining held-out sequence, and in which the kernels used in the MCMC procedure depend on the ground 
truth labels. (The authors also considered a "parameterized segmental SLDS (PS-SLDS)," which makes use of 
domain knowledge specific to honey bee dancing and requires additional supervision during the learning process.) 
Nonetheless, in Table [III] we report the performance of these methods as well as the median performance (over 100 
trials) of the unsupervised HDP-VAR(1)-HMM in order to provide a sense of the level of performance achievable 
without detailed, manual supervision. As seen in Table [Till the HDP-VAR(1)-HMM yields very good performance 
on sequences 4 to 6 in terms of the learned segmentation and number of modes (see Fig. [5); the performance 
approaches that of the supervised method. For sequences 1 to 3 — which are much less regular than sequences 4 to 
6 — the performance of the unsupervised procedure is substantially worse. In Fig. |4] we see the extreme variation 
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sequence 


1 


2 


3 


4 


5 


6 


HDP-VAR(1)-HMM unsupervised 


45.0 


42.7 


47.3 


88.1 


92.5 


88.2 


HDP-VAR(1)-HMM partially supervised 


55.0 


86.3 


81.7 


89.0 


92.4 


89.6 


SLDS DD-MCMC 


74.0 


86.1 


81.3 


93.4 


90.2 


90.4 


PS-SLDS DD-MCMC 


75.9 


92.4 


83.1 


93.4 


90.4 


91.0 



TABLE III 

Median label accuracy of the HDP-VAR(1)-HMM using unsupervised and partially supervised Gibbs sampling, 

COMPARED TO ACCURACY OF THE SUPERVISED PS-SLDS AND SLDS PROCEDURES, WHERE THE LATTER ALGORITHMS WERE BASED 

ON A SUPERVISED MCMC PROCEDURE (DD-MCMC) ifTOll . 
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Fig. 6. For an order 1, 2, and 7 HDP-AR-HMM with a MNIW prior and an order 7 HDP-AR-HMM with an ARD prior, we plot the 
shortest intervals containing 95% of the held-out log-likelihoods calculated based on a set of Gibbs samples taken at iteration 1000 from 
100 chains, (a) Log-likelihood of the second half of honey bee dance sequence 4 based on model parameters inferred from the first half of 
the sequence, (b)-(c) Similarly for sequences 5 and 6, respectively. 



in head angle during the waggle dances of sequences 1 to 3@ As noted by Oh, the tracking results based on the 
vision-based tracker are noisier for these sequences and the patterns of switching between dance modes is more 
irregular. This dramatically affects our performance since we do not use domain-specific information. Indeed, our 
learned segmentations consistently identify turn-right and turn-left modes, but often create a new, sequence-specific 
waggle dance mode. Many of our errors can be attributed to creating multiple waggle dance modes within a 
sequence. Overall, however, we are able to achieve reasonably good segmentations without having to manually 
input domain-specific knowledge. 

MNIW Prior — Partially Supervised: The discrepancy in performance between our results and the supervised 
approach of Oh et. al. 1 10] motivated us to also consider a partially supervised variant of the HDP-VAR(1)-HMM 
in which we fix the ground truth mode sequences for five out of six of the sequences, and jointly infer both a 
combined set of dynamic parameters and the left-out mode sequence. This is equivalent to informing the prior 
distributions with the data from the five fixed sequences, and using these updated posterior distributions as the 
prior distributions for the held-out sequence. As we see in Table ITTT1 this partially supervised approach considerably 
improves performance for these three sequences, especially sequences 2 and 3. Here, we hand-aligned sequences so 
that the waggle dances tended to have head angle measurements centered about tt/2 radians. Aligning the waggle 
dances is possible by looking at the high frequency portions of the head angle measurements. Additionally, the 
pre-processing of the unsupervised approach is not appropriate here as the scalings and shiftings are dance-specific, 
and such transformations modify the associated switching VAR(l) model. Instead, to account for the varying frames 
of reference (i.e., point of origin for each bee body) we allowed for a mean on the process noise, and placed 
an independent J\f(0, So) prior on this parameter. See the Appendix for details on how the hyperparameters of 
these prior distributions are set. 

ARD Prior: Using the cleaner sequences 4 to 6, we investigate the affects of the sparsity-inducing ARD prior 
by assuming a higher order switching VAR model and computing the likelihood of the second half of each dance 
sequence based on parameters inferred from Gibbs sampling using the data from the first half of each sequence. 
In Fig. [6l we specifically compare the performance of an HDP-VAR(r)-HMM with a conjugate MNIW prior 
for r = 1,2,7 to that of an HDP-VAR(7)-HMM with an ARD prior. We use the same approach to setting the 
hyperparameters as in Sec. IIV-BI We see that assuming a higher order model improves the predictive likelihood 



9 From Fig. [4] we also see that even in sequences 4 to 6, the ground truth labeling appear to be inaccurate at times. Specifically, certain 
time steps are labeled as waggle dances (red) that look more typical of a turning dance (green, blue). 
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performance, but only when combined with a regularizing prior (e.g., the ARD) that avoids over-fitting in the 
presence of limited data. Although not depicted here (see instead Il36l0 . the ARD prior also informs us of the 
variable-order nature of this switching dynamical process. When considering an HDP-VAR(2)-HMM with an ARD 
prior, the posterior distribution of the ARD hyperparameters for the first and second order lag components associated 
with each of the three dominant inferred dances clearly indicates that two of the turning dances simply rely on the 
first lag component while the other dance relies on both lag components. To verify these results, we provided the 
data and ground truth labels to MATLAB's lpc implementation of Levinson's algorithm, which indicated that the 
turning dances are well approximated by an order 1 process, while the waggle dance relies on an order 2 model. 
Thus, our learned orders for the three dances match what is indicated by Levinson's algorithm on ground-truth 
segmented data. 

V. Model Variants 

There are many variants of the general SLDS and switching VAR models that are pervasive in the literature. 
One important example is when the dynamic matrix is shared between modes; here, the dynamics are instead 
distinguished based on a switching mean, such as the Markov switching stochastic volatility (MSSV) model. In the 
maneuvering target tracking community, it is often further assumed that the dynamic matrix is shared and known 
(due to the understood physics of the target). We explore both of these variants in the following sections. 

A. Shared Dynamic Matrix, Switching Driving Noise 

In many applications, the dynamics of the switching process can be described by a shared linear dynamical 
system matrix A; the dynamics within a given mode are then determined by some external force acting upon this 
LDS, and it is how this force is exerted that is mode-specific. The general form for such an SLDS is given by 

z t | zt-i ~ vr 2t _ 1 
x t = Ax t -i + et(zt) y t = Cx t + w t , 

with process and measurement noise e t (k) ~ A/"(/z( fc ),£ W ) and w t ~ M(0,R), respectively. In this scenario, the 
data are generated from one dynamic matrix, A, and multiple process noise covariance matrices, Y,( k \ Thus, one 
cannot place a MNIW prior jointly on these parameters (conditioned on fi^) due to the coupling of the parameters 
in this prior. We instead consider independent priors on A, Y<( k \ and We will refer to the choice of a normal 
prior on A, inverse-Wishart prior on Y>( k \ and normal prior on [i^ as the N-IW-N prior. See [36] for details on 
deriving the resulting posterior distributions given these independent priors. 

Stochastic Volatility: An example of an SLDS in a similar form to that of Eq. (|37l ) is the Markov switching 
stochastic volatility (MSSV) model 0, @, EH. The MSSV assumes that the log-volatilities follow an AR(1) 
process with a Markov switching mean. This underlying process is observed via conditionally independent and 
normally distributed daily returns. Specifically, let y t represent, for example, the daily returns of a stock index. The 
state xt is then given the interpretation of log-volatilities and the resulting state space is given by Q 

z t | zt-i ~ vr Zt _ 1 
x t = axt-i + e t (z t ) y t = u t (x t ), 

with et(k) ~ Af(^ k \cr 2 ) and ut(xt) ~ A/"(0, exp(x t )). Here, only the mean of the process noise is mode-specific. 
Note, however, that the measurement equation is non-linear in the state x t . Carvalho and Lopes [7 ] employ a particle 
filtering approach to cope with these non-linearities. In (6], the MSSV is instead modeled in the log-squared-daily- 
returns domain such that 

log(y t 2 ) = x t + w t , (39) 

where w t is additive, non-Gaussian noise. This noise is sometimes approximated by a moment-matched Gaus- 
sian 11391 , while So et. al. use a mixture of Gaussians approximation. The MSSV is then typically bestowed a 
fixed set of two or three regimes of volatility. 

We examine the IBOVESPA stock index (Sao Paulo Stock Exchange) over the period of 01/03/1997 to 01/16/2001, 
during which ten key world events are cited in Q as affecting the emerging Brazilian market during this time period. 
The key world events are summarized in Table IIVI and shown in the plots of Fig. [7] Use of this dataset was motivated 



MIT LIDS TECHNICAL REPORT #2830 



17 



Date 


Event 


07/02/1997 


Thailand devalues the Baht by as much as 20% 


08/11/1997 


IMF and Thailand set a rescue agreement 


10/23/1997 


Hong Kongs stock index falls 10.4%. South Korea won starts to weaken 


12/02/1997 


IMF and South Korea set a bailout agreement 


06/01/1998 


Russias stock market crashes 


06/20/1998 


IMF gives final approval to a loan package to Russia 


08/19/1998 


Russia officially falls into default 


10/09/1998 


IMF and World Bank joint meeting to discuss global economic crisis. The Fed cuts interest rates 


01/15/1999 


The Brazilian government allows its currency, the Real, to float freely by lifting exchange controls 


02/02/1999 


Arminio Fraga is named President of Brazils Central Bank 



TABLE IV 

Table of 10 key world events affecting the IBOVESPA stock index (Sao Paulo Stock Exchange) over the period of 

01/03/1997 to 01/16/2001, as cited by Carvalho and Lopes 0. 

by the work of Carvalho and Lopes [7], in which a two-mode MSSV model is assumed. We consider a variant of 
the HDP-SLDS to match the MSSV model of Eq. (|38l ). Specifically we examine log-squared daily returns, as in 
Eq. (|39l ), and use a DP mixture of Gaussians to model the measurement noise: 

e t (k) ~ A^(m (&) ,EW) 

w t ~ YT=i w*A/"(0, Rt) w ~ GEM(oy), R £ ~ IW(n r , S r ). (40) 

We truncate the measurement noise DP mixture to 10 components. For the HDP concentration hyperparameters, 
a, 7, and k, we use the same prior distributions as in Sec. IIV-AllrV-Cl For the dynamic parameters, we rely on 
the N-IW-N prior described in Sec. IV-AI and once again set the hyperparameters of this prior from statistics of the 
data as described in the Appendix. Since we allow for a mean on the process noise and examine log-squared daily 
returns, we do not preprocess the data. 

The posterior probability of an HDP-SLDS inferred change point is shown in Fig. Ha), and in Fig. |7]T>) we 
display the corresponding plot for a non-sticky variant (i.e., with k = so that there is no bias towards mode 
self-transitions.) The HDP-SLDS is able to infer very similar change points to those presented in 0. Without the 
sticky extension, the non-sticky model variant over-segments the data and rapidly switches between redundant states 
leading to many inferred change points that do not align with any world event. In Fig. He), the overall change-point 
detection performance of the HDP-SLDS is compared to that of the HDP-AR(1)-HMM, HDP-AR(2)-HMM, and 
non-sticky HDP-SLDS. The ROC curves shown are calculated by windowing the time axis and taking the maximum 
probability of a change point in each window. These probabilities are then used as the confidence of a change point 
in that window. From this plot, we clearly see the advantage of using an SLDS model combined with the sticky 
HDP-HMM prior on the mode sequence. 

We also analyzed the performance of an HDP-SLDS as defined in Table U We used raw daily -return observations, 
and first pre-processed the data in the same manner as the honey bee data by centering the observations around 
and scaling the data to be roughly within a [—10,10] dynamic range. We then took a MNIW prior on the 
dynamic parameters, as outlined in the Appendix. Overall, although the state of this HDP-SLDS does not have the 
interpretation of log-volatilities, we see are still able to capture regime-changes in the dynamics of this stock index 
and find changepoints that align better with the true world events than in the MSSV HDP-SLDS model. 

B. Fixed Dynamic Matrix, Switching Driving Noise 

There are some cases in which the dynamical model is well-defined through knowledge of the physics of the 
system being observed, such as simple kinematic motion. More complicated motions can typically be modeled 
using the same fixed dynamical model, but using a more complex description of the driving force. A generic LDS 
driven by an unknown control input u t can be represented as 

x t = Ax t -\ + Bu t + v t y t = Cx t + Du t + w t , (41) 

where v t ~ Af(0, Q) and w t ~ /V(0, R). It is often appropriate to assume D = 0, as we do herein. 
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Day Index Day Index False Alarm Rate 

(d) (e) (f) 

Fig. 7. (a) Plot of the estimated probability of a change point on each day using 3,000 Gibbs samples for a MSSV variant of the HDP-SLDS 
using a shared dynamic matrix and allowing a mean on the mode-specific process noise and a mixture of Gaussian measurement noise model. 
The observations are log-squared dialy return measurements, and the 10 key events are indicated with red lines, (b) Similar plot for the 
non-sticky HDP-SLDS with no bias towards self-transitions, (c) ROC curves for the HDP-SLDS, non-sticky HDP-SLDS, HDP-AR(1)-HMM, 
and HDP-AR(2)-HMM. (d)-(f) Analogous plots for the HDP-SLDS of Table U using raw daily return measurements. 



Maneuvering Target Tracking: Target tracking provides an application domain in which one often assumes that 
the dynamical model is known. One method of describing a maneuvering target is to consider the control input as 
a random process [40]. For example, a jump-mean Markov process ATI yields dynamics described as 

Z t | Zt-l ~ 7T^_ 1 

x t = Ax t -\ + Bu t (z t ) + v t y t = Cx t + w t (42) 
ut(k) ~ A/"(M (fc) , SW) v t ~ Af(0, Q) w t ~ Af(0, R). 

Classical approaches rely on defining a fixed set of dynamical modes and associated transition distributions. The 
state dynamics of Eq. (1421 can be equivalently described as 

x t = Ax t -i + et(zt) (43) 
et(fc) ~ M{B^ k \B^B T + Q). (44) 

This model can be captured by our HDP-SLDS formulation of Eq. (l37l) with a fixed dynamic matrix (e.g., constant 
velocity or constant acceleration models [40]) and mode-specific, non-zero mean process noise. Such a formulation 
was explored in (9| along with experiments that compare the performance to that of standard multiple model 
techniques, demonstrating the flexibility of the Bayesian nonparametric approach. Fox et. al. also present an 
alternative sampling scheme that harnesses the fact that the control input may be much lower-dimensional than the 
state and sequentially block-samples (z t ,u t ) analytically marginalizing over the state sequence x\-t- Note that this 
variant of the HDP-SLDS can be viewed as an extension of the work by Caron et. al. ll42l in which the exogenous 
input is modeled as an independent noise process (i.e., no Markov structure on zt) generated from a DP mixture 
model. 

VI. Conclusion 

In this paper, we have addressed the problem of learning switching linear dynamical models with an unknown 
number of modes for describing complex dynamical phenomena. We presented a Bayesian nonparametric approach 
and demonstrated both the utility and versatility of the developed HDP-SLDS and HDP-AR-HMM on real appli- 
cations. Using the same parameter settings, although different model choices, in one case we are able to learn 
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changes in the volatility of the IBOVESPA stock exchange while in another case we learn segmentations of data 
into waggle, turn-right, and turn-left honey bee dances. We also described a method of applying automatic relevance 
determination (ARD) as a sparsity-inducing prior, leading to flexible and scalable dynamical models that allow for 
identification of variable order structure. We concluded by considering adaptations of the HDP-SLDS to specific 
forms often examined in the literature such as the Markov switching stochastic volatility model and a standard 
multiple model target tracking formulation. 

The batch processing of the Gibbs samplers derived herein may be impractical and offline-training online-tracking 
infeasible for certain applications. Due both to the nonlinear dynamics and uncertainty in model parameters, exact 
recursive estimation is infeasible. One could leverage the conditionally linear dynamics and use Rao-Blackwellized 
particle filtering (RBPF) |43l . However, one challenge is that such particle filters can suffer from a progressively 
impoverished particle representation. 

Overall, the formulation we developed herein represents a flexible, Bayesian nonparametric model for describing 
complex dynamical phenomena and discovering simple underlying temporal structures. 

Appendix 

a) MNIW General Method: For the experiments of Sec. IIV-A1 we set M = and K = I m . This choice centers 
the mass of the prior around stable dynamic matrices while allowing for considerable variability. The inverse-Wishart 
portion is given no = m + 2 degrees of freedom. For the HDP-AR-HMM, the scale matrix So = 0.75S, where 
S = y Y^ilft ~ y){Ut ~ y) T ■ Setting the prior directly from the data can help move the mass of the distribution 
to reasonable values of the parameter space. For an HDP-SLDS with x t G W 1 and y t G R d and n = d, we 
set Sq = 0.675S. We then set the inverse-Wishart prior on the measurement noise, R, to have tq = d + 2 and 
#0 = 0.075S. For n > d, see 1551 . 

b) Partially Supervised Honey Bee Experiments: For the partially supervised experiments of Sec. IIV-CI we 
set So = 0.755Vj- Since we are not shifting and scaling the observations, we set Sq to 0.75 times the empirical 
covariance of the first difference observations. We also use no = 10, making the distribution tighter than in the 
unsupervised case. Examining first differences is appropriate since the bee's dynamics are better approximated as 
a random walk than as i.i.d. observations. Using raw observations in the unsupervised approach creates a larger 
expected covariance matrix making the prior on the dynamic matrix less informative, which is useful in the absence 
of other labeled data. 

c) IBOVESPA Stock Index Experiments: For the HDP-SLDS variant of the MSSV model of Eq. (EH), we rely 
on the N-IW-N prior described in Sec. IV-AI For the dynamic parameter a and process noise mean /j,( k \ we use 
A/"(0, 0.75S) priors. The IW prior on £(*) was given 3 degrees of freedom and an expected value of 0.75S. Finally, 
each component of the mixture-of-Gaussian measurement noise was given an IW prior with 3 degrees of freedom 
and an expected value of 5*7r 2 , which matches with the moment-matching technique of Harvey et. al. 11391 . For the 
HDP-AR(r)-HMM's to which we compare in Fig. |7J we place a zero-mean normal prior on the dynamic parameter 
a with covariance set to the expected noise covariance, which in this case is equal to 0.75 times the empirical 
covariance plus 5 * it 2 . The mean parameter is defined as in the HDP-SLDS. 

For the HDP-SLDS comparison using the model of Table H we use a MNIW prior with M = 0, K = 1, no = 3, 
and Sq = 0.75S. The IW prior on R was given vq = 100 and an expected covariance of 25. Our sampler initializes 
parameters from the prior, and we found it useful to set the prior around large values of R in order to avoid initial 
samples chattering between dynamical regimes caused by the state sequence having to account for the noise in 
the observations. After accounting for the residuals of the data in the posterior distribution, we typically learned 
R « 10. 
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Appendix A 
Dynamic Parameter Posteriors 

In this appendix, we derive the posterior distribution over the dynamic parameters of a switching VAR(r) process 
defined as follows: 



y t =J2A < f t) y t „ i + e t (z t ) e t (k) ~ M(^ k \^) 



(45) 



i=l 



where z t indexes the mode-specific VAR(r) process at time t. Assume that the mode sequence {z\, . . . , zt} is known 
and we wish to compute the posterior distribution of the k th mode's VAR(r) parameters A^ for i = 1, . . . , r and 
Y,( k \ Let {ii, ... ,t nk } = {t\z t = k}. Then, we may write 



2/tx Vt 2 



L 2 



(fc) 



2/ti — 1 l/ta-l 
Vt 2 -2 



Vt n 

Vtn 



+ [<Hy e t2 



e t„ 



We define the following notation for Eq. ([46 



Y( fc ) = + 



(46) 



(47) 



and let = {Y^, Y^}. In the following sections, we consider two possible priors on the dynamic parameter. 
In Appendix A- |A] we assume that is for all k and consider the conjugate matrix-normal inverse-Wishart 
(MNIW) prior for {A^, In Appendix A- E we consider the more general form of Eq. ( |43T ) and take 

independent priors on A^, ^ k \ and ^ k \ 



A. Conjugate Prior — MNIW 

To show conjugacy, we place a MNIW prior on the dynamic parameters {A^, E^} and show that the posterior 
remains MNIW given a set of data from the model of Eq. (|43T ) (assuming = 0). The MNIW prior is given by 
placing a matrix-normal prior MM [A^; M, Y>( k \ K) on A^ given E^: 



(48) 



and an inverse-Wishart prior W(n ,S ) on 



|5 |no/2| E (fc)|-(d+n„+l)/2 / 1 \ 

p(E< fc ») = ^ -L exp --tr(S~( fc )5o) (49) 

where r^(-) is the multivariate gamma function and denotes (I?^) -1 for some matrix B. 

We first analyze the likelihood of the data, T)( k \ given the k th mode's dynamic parameters, {A^ k \ T,^}. 
Starting with the fact that each observation vector, y t , is conditionally Gaussian given the lag observations, y t = 
[Vt-i ■ ■ ■ Vt-r} T > we have 

p(D<*>|A« £«) = |27rS( i )|nt/2 exp f-~I>^ -A( fe )y ti ) T E-W(y ti - A(\)j 

= ^(YW;A W Y (fc) ,sW,j). (50) 
To derive the posterior of the dynamic parameters, it is useful to first compute 

p(DW,AW I = p(D^ | A {k \ E( fc ))p(A^ | E^). (51) 
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Using the fact that both the likelihood p(D^ | A^,^) and the prior p(A^ | are matrix-normally 

distributed sharing a common parameter E", we have 

logp(D«,A^ | S^) + C 

= - l -tr{(Y^ - A ( fc )YW) T S-W(Y( fc ) - A^yW) + (A« - M) T S~W(A( fc ) - M)K) 

= -itr(E-W{A( fe )sf AW T - 2Sg } A( fc ) T + S«}) 

= -^r(S-( fe ){(A( fe ) - S^Srf ) S ft)(A( fc ) - S#S#>) T + sgj}), (52) 
where we have used the definitions: 

r _ 1-- 1 \ K \ d ' 2 oW = O(fc) _ qWq-WqW T 

s | 2vr s(fc)|n fc /2 |2 7r 5]( fe )| rn '=/ 2 »l« ra > 

S<£ = Y^Y^ + K S$ = Y^Y^ + MK S« = Y«yW T + MKM T . 
Conditioning on the noise covariance Y<( k \ we see that the dynamic matrix posterior is given by: 

p(A« | DW,SW) oc exp (-i*r((A« - S«S^) t E-«(AW - S#S^)S$)) 

= A4Ar(A«;SWsrf,S«,sg). (53) 

Marginalizing Eq. d52l over the dynamic matrix A^ k \ we derive 

p( D ( fe ) | £«) = / p(D( fe ), A« | S«)dA« 
Jaw 

= , lK }*'* „ exp f-itr(S-( fc )S ( P)') / _1 AtArfAWjSfflS^.EW.Sffi) dA« 
|27rS( fc )| n "/ 2 V 2 rAy J Jaw \S^\ d / 2 ^ ' 

■ (<x P ( --/,'(E-"=;s;p], (54) 



|27rX( fe )|W 2 |sgV /2 



which leads us to our final result of the covariance having an inverse-Wishart marginal posterior distribution 
| dW) oc p(DW I EW)p(£W) 

|if | d / 2 f-Itr(S-«S ( fl)l |S( fc )|-( d+ ^ +1 )/ 2 exp f-Iir(£-(% 

|27rSW|W2| S gV /2 V 2 1 ^V' ' \ 2 1 

oc |s(fc)|-(*+n*+«o+i)/2 exp ^_I tr (£-«( S « + So))) 

= IW(n fc + no,Sjg + Sb). (55) 

5. Non-Conjugate Independent Priors on A^ k \ T,^ k \ and 

In this section, we provide the derivations for the posterior distributions of A^ k \ Y>( k \ and fi^ when each 
of these parameters is given an independent prior. One example of a non-conjugate prior is our proposed ARD 
sparsity-inducing prior. 
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a) Normal Prior on A^ k \- Assume we place a Gaussian prior, J\f(fi A , T, A ), on the vectorization of the matrix 
A™, which we denote by vec(A( fe )). To examine the posterior distribution, we first aim to write the data as a 
linear function of vec(A( fc )). We may rewrite Eq. d45l as 



y t = A^[yU yj_ 



yfj +e t Mt\z t = k 



(56) 



Recalling that r is the autoregressive order and d the dimension of the observation vector y t , we can equivalently 
represent the above as 

.(*) 



Vt 



yt,i yt,2 
o o 







Vt,d*r o 

yt,i yt,2 







y~t,d* r o o 

Vt,l Vt,2 



!Jt, 



M 

l l,2 



,(*) 
l l,d*r 
(k) 

(k) 

a 2,2 



(k) 
a d,d*r 



+ e t (k) 



[ Vt,lh Vt,2h • • ■ Vt,d*rld ] vec(A^) + e t {k) ± Y t vec{A) + e t {k). 



(57) 



Here, the columns of y t are permutations of those of the matrix in the first line such that we may write y t as a 
function of vec(AW). Noting that e t (k) ~ Af{ti (k) , E (fe) ), 

logp(D^, A^ | S( fc ),/i( fe )) 

= C ~ \ E (Vt - V (k) - ^vec(A( fc ))) T S-W(y 4 - n {k) ~ Y t vec(A^)) 



t\z t =k 



-(vec(A( fc )) - m A ) T S A 1 (vec(A( fc ^ 



m A ) 



(58) 



which can be rewritten as, 



logp(D« A« | ^ k \n {k) ) = C- ivec(A( fc )f ( S A X + £ yJ^y] V ec(A^) 

\ t\z t =k J 

+ vec(A«f U/m A+ £ Y^{y t -^) 

\ t\z t =k f 

- \m\^m A - \ Y: (y t - ^ k) ) T ^ (k \y t - ^ k) ) 

t\z t =k 

Conditioning on the data, we arrive at the desired posterior distribution 

logp(A( fc ) | DW,£W,/x(*)) = C- -fvec(A( fc )) T (S A 1 + ^ Y T t YT^Y t )vzz{A^) 



(59) 



t\z t =k 



-2vec(A( fc )f(S A 1 m A + £ Y T t ^ k \y t -^)) 

t\z t =k 



t\z t =k t\z t =k 



(60) 
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b) Inverse Wishart Prior on : We place an inverse-Wishart prior, IW(no, So), on Let = \{t\zt = 
k,t = 1, 2, . . . , T}\. Conditioned on A^ and /u (fe) , standard conjugacy results imply that the posterior of Y,^ is: 

p(E « | D W A W „<*>) = IW L + n , 5 + £ (y t - A«fc - ^)(y t - A^y t - ^f) • (61) 

\ t\z t =k J 

c) Normal Prior on /i^': Finally, we place a Gaussian prior, J\f(n ,Y<o), on fj,^. Conditioned on A( fe ) and 
the posterior of is: 

|D«AW AT 1 j/i( fc );S -Vo + S-( fc) £ (y t - A^y,), S^ 1 + n,S-( fc ) ] . (62) 

We iterate between sampling A( fc \ £( fc ), and many times before moving on to the next step of the Gibbs 
sampler. 

Appendix B 

Sparsity-Inducing Priors for Inferring Variable Order Models 

Recall Sec. IIII-A2I and the proposed automatic relevance determination (ARD) prior for inferring non-dynamical 
components of the state vector in the case of the HDP-SLDS or lag components in the HDP-AR-HMM by shrinking 
components of the model parameters to zero. However, if we would like to ensure that our choice of C = [Id 0] does 
not interfere with learning a sparse realization if one exists, we must restrict ourselves to considered a constrained 
class of dynamical phenomenon. For example, imagine a realization of an LDS with 



A 

Then, the transformation to C = [1 0] leads to 

A = T~ l AT 



0.8 
0.2 



C=[l 1] 



0.5 1 
0.15 0.3 



for 



0.5 
0.5 



So, for this example, fixing C = [1 0] would not lead to learning a sparse dynamical matrix A. Criterion 13.11 
provides a set of sufficient, though not necessary, conditions for maintaining the sparsity within each when 
transforming to the realization with C = [Id 0]. That is, given there exists a realization TZ\ of our dynamical 
phenomena that satisfies Criterion 13.11 the transformation T to an equivalent realization 7^2 with C = [Id 0] will 
maintain the sparsity structure seen in TZ\, which we aim to infer with the ARD prior. Criterion 13.11 which states 
that the observed state vector components are a subset of those relevant to all modes, is reasonable for many 
applications: we often have observations only of components of the state vector that are essential to all modes 
while some modes may have additional components that affect the dynamics, but are not directly observed. 
To clarify the conditions of Criterion 13.11 consider a 3-mode SLDS realization 1Z with 



A (D 



a 



1 a 2 



"3 

(3) 



a 





(3) 



A (2) 



a 



(2) „(2) 



a 



a 



(2) 







a 



a 



(3) 



a 



(3) 



(63) 



then the observation matrix must be of the form C = [ci C2 0] to satisfy Criterion 13.11 



Appendix C 

HDP-SLDS AND HDP-AR-HMM MESSAGE PASSING 

In this appendix, we explore the computation of the backwards message passing and forward sampling scheme 
used for generating samples of the mode sequence z\-t and state sequence x\-t- 
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A. Mode Sequence Message Passing for Blocked Sampling 

Consider a switching VAR(r) process. To derive the forward-backward procedure for jointly sampling the mode 
sequence z\-jt given observations plus r initial observations y 1 _ r . , we first note that the chain rule and 

Markov structure allows us to decompose the joint distribution as follows: 

P(Z1:T I Vl-r-.T^ = P{ Z T \ Z T -1 , Vl-r-.T, ™, 0)p(z T -l | ZT-2, Vl-r-.T, ^,0) 

■■■p(z 2 | z 1 ,y 1 _ r , T ,7T,6)p(z 1 I i/i_ r:T ,7r,0). (64) 

Thus, we may first sample z\ from p{z\ \ y 1 _ r . T ,7v,0), then condition on this value to sample z 2 from p(z 2 \ 
zi,y 1 _ r . T ,7v,6), and so on. The conditional distribution of z\ is derived as: 



P( z l I Vl-r-.T, f,0) °LP(Z1)P(V1 I ^Zi,yi-no)^2Yl P ^ Zt ' ^^-MVt I z t ,V t -r:t-l) 

Z2-.T t 

ccp{z 1 )p(y 1 | 9 Zl ,y 1 _ r . )^2p(z 2 | tTz 1 )p(v 2 | Q z ^y 2 _ r , x )m^ 2 (z 2 ) 



(xp(z 1 )p(y 1 | 6» Zl ,y 1 _ r:0 )m 2 ,i(2;i), (65) 
where m tj t_i(z t _i) is the backward message passed from z t to zt-i and is recursively defined by: 

™ W / I ^-i)p(»t I ^»Vt-r:t-l)»"t+l I t(«t)> *< T ! 

mt,t-i\zt-i) oc | j ' Coo) 
The general conditional distribution of z t is: 

P0t I z t -x,y 1 -. r .. T ,'K,0) ccp(z t | ■K Zt _ 1 )p(y t | e Zt ,y t _ r . t _ 1 )m t+ i,t{zt)- (67) 
For the HDP-AR-HMM, these distributions are given by: 

r 

p{z t = k | Zt-ljI/l-nT,*"^) ^-iW^d/^X) ^ i yt-i^^j^+iptW 

i=l 

L r 
j=l i=l 

m T+1T (k) = l k = l,...,L. (68) 



B. State Sequence Message Passing for Blocked Sampling 

A similar sampling scheme is used for generating samples of the state sequence xi-t- Although we now have a 
continuous state space, the computation of the backwards messages m t +i : t(x t ) is still analytically feasible since we 
are working with Gaussian densities. Assume, mt+i : t(xt) oc N~ 1 (x t ; Ot+i,t, At+i t), where N~ 1 {x] 9, A) denotes a 
Gaussian distribution on x in information form with mean fj, = Ar l 6 and covariance S = A -1 . Given a fixed mode 
sequence Z\ : t, we simply have a time-varying linear dynamic system. The backwards messages for the HDP-SLDS 
can be recursively defined by 



m U -\{Xt-i) oc / p(xt\xt-i,z t )p(y t \xt)mt +1) t(xt)dx t . 



For this model, the state transition density of Eq. (1691 ) can be expressed as 



p(x t \x t -i, zt) oc exp l~{xt - A^xt-i - ^) T ^\x t - A^x t - X - ^)} 



(69) 



(70) 



oc exp 





T r 


X t -1 




x t 





^4(«t) T j]-(2t)^(z{) _ J 4(2t) T s^( 2:t ) 



+ 









T r 


' X t -1 ' 




x t 





Xt-1 

x t 
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We can similarly write the likelihood in exponentiated quadratic form 

p(y t \xt) oc exp J - -(y t - Cx t ) T R~ l {y t - Cx t ) 



(71) 



oc exp 
as well as the messages 





T r 


X t -1 




x t 







o cFr- x c 



Xt-1 

x t 



+ 



Xt-1 

x t 



T r 





C T R- l y t 



m t+ i,t{x t ) oc exp \ - -x t A t+ i it x t + x t 9 t+ i it 



(72) 



x t -i 
x t 



oc exp 

The product of these quadratics is given by: 

p(x t \x t -i, z t )p(y t \xt)m t+ i,t(xt) oc 





A t +i,t 



x t -i 
x t 



+ 



Xt-1 

x t 





h+i,t 



exp 



H 


X t -1 


T r 




x t 





+ 



A^ 1 TT^ A —A^tV-^-izt) 

A^ E~W + C T R- l C + A t+M 
_ J 4( 2 t) T j]^( 2t )^( z t) 
C T R- l y t + E-(*V*) + t+l t _ 

Using standard Gaussian marginalization identities we integrate over x t to get, 

m t ,t-i (x t -i) oc N~ x (x t -i ; 6t,t-i, A t ,t-i) , 

where, 



x t -i 
x t 





T r 


Xt-1 




x t 





(73) 



(74) 



? tt _! = _A^ T £-^V (2t) + A^E-^E"^ + C T R~ l C + A t+ i it ) ^(C^irVt + E~^V (2t) + Qt+i,t) 



At, t _i = ^fe) T S -fe) A (^) _ A (z t ) T s -( Zt )( S -fe) + c T RT l C + A^^E^A^. 

The backwards message passing recursion is initialized with rriT+i,T ~ A/" _1 (:bt;0,0). Let, 

C7 T i?- 1 C7 + A t+1 , t 
C T R- 1 y t + . 



(75) 



(76) 



y t+i,t- 



Then we can define the following recursion, which we note is equivalent to a backwards running Kalman filter in 
information form, 

A b_ i|i _ i = c T R- x C + A^ T T,-^A^ - A^ 1 '£-(*)(£-(*) + C T R- X C + A^^E^A^ 
= C T RT X C + A^ T,- {zt) A {zt) - AWjr^E-W + AjJ-^-WilW 

^ b _ 1|t _ 1 = C T R- l y t _ x - A^E^'V^ + A^E-W + C T R- X C + A^)" 1 

• {C T R7 x y t + E-(*V*) + %i, t ) 
= C T R7 x y t _ x - A^E^'V^ + AWrW^W + A*^) -1 ^^ + E~W/x W ) 



We initialize at time T with 



Ar£\rj\ C R C 

yTd-1. 



(77) 



^t|t — C R yj< 

An equivalent, but more numerically stable recursion is summarized in Algorithm [5] 

After computing the messages mt+i,t{xt) backwards in time, we sample the state sequence x\-t working forwards 
in time. As with the discrete mode sequence, one can decompose the posterior distribution of the state sequence as 



P{xi-.T | yi :T ,Zl: T ,0) = p(x T \ Xt-1, Vl-.Ti z l:T, 0)p(.&T-l \ &T-2, 3/l : T> Z 1:T, 0) 

■ ■■p(x 2 | x 1 ,y 1 . T ,z 1 . T ,0)p(x 1 | y 1:T ,z 1:T ,0). 



(78) 
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1) Initialize filter with 



Q h = C T TC x y T 



2) Working backwards in time, for each t & {T — 1,...,1}: 
a) Compute 

j t+1 = Al +Mt+ M +llt+1 + E-^y 

Lt+i = I — Jt+i- 



b) Predict 



c) Update 



3) Set 



A m , t = A^ T (Lt + iA b t+llt+1 LJ +1 + J t+1 ^ z ^jJ +l )A {zt+l) 
-U = a^ t L t+1 {6 b t+Mt+1 - Atn m /^ +l) ) 



K\ t = A t+1)t + C T R- l C 
6 t \t = 6t+i,t + C T R 1 y t 



A o|o - A i,o 

#0|0 = ^1,0 



Algorithm 5: Numerically stable form of the backwards Kalman information filter. 



where 

p(x t | x t - 1 ,y 1:T ,zi.. T ,0) ocp{x t | x t ^ u A^ Zt) ,^ Zt \^ Zt) )p{y t \ x t ,R)mt+i,t(x t ). (79) 
For the HDP-SLDS, the product of these distributions is equivalent to 

p(x t | Xt-i,y 1 . T ,Zx: T ,9) (x M(x t ; A^xt^ + ^ Zt \j: (z ^)M(y t ]Cx t , R)m t+ht (x t ) 

oc AT(xf,A^Xt-i + V (zt) , ^ Zt) )M-\x t ; 9\ t , k\ t ) 
oc M-\x t - V-M(AMxt- X + + 9\ v E-M + A t 6 |t ), (80) 

which is a simple Gaussian distribution so that the normalization constant is easily computed. Specifically, for each 
t € {1, . . . , T} we sample x t from 

xt ~ M{x t - (E-M + AZ^p-MAMxt-! + £-<*y*> + e b t\tl + A^)" 1 ). (81) 

C. Mode Sequence Message Passing for Sequential Sampling 

A similar sampling scheme to Carter and Kohn 11371 is used for generating samples of the mode sequence z\-t 
having marginalized over the state sequence x\ : t. Specifically, we sample z t from: 

p(z t = k | z\ t , y 1:T , 7r, 0) oc p(z t = k j z\ t ,7v)p(y 1:T \ z t = k, z\ t ) 

oc n Zt _ 1 (k)n k (zt+i)p(yi. T I zt = k,z\ t ). (82) 



MIT LIDS TECHNICAL REPORT #2830 



29 



We omit the dependency on 7v and 6 for compactness. To compute the likelihood for each z t , we combine forward 
and backward messages along with the local dynamics and measurements as follows: 



(83) 

(84) 

(85) 

(86) 
(87) 



p(Vi:T I z t = k,z\ t ) oc / / m t -2,t-i(xt-i)p(y t -i I x t -i)p(x t \x t -i,z t = k) 

p(y t \x t )m t+1)t {xt)dx t dxt-i 

oc / m t -2,t-i(xt~i)p(y t -i I x t -i)p(xt\x t -i,zt = k)dx t - 

p(y t \x t )m t+lt t(xt)dx t , 
where the backwards messages are defined as in Appendix iBl and the forward messages by: 



m t -i t t{xt) oc 



/ p{x t \xt~i,zt)p{y t _ l \xt-i)m t -2,t-i{xt-i)dxt- 

JXt-i 



To derive the forward message passing recursions, assume that 

m t -2,t-l{xt-i) oc J\f~ 1 (x t -i;9t-2,t-i, A t - 2 ,t-i) 
and zt is known. The terms of the integrand of Eq. (l85l) can be written as: 

p(x t \x t -i, zt) = M(x t - A^xt-i + SW) 

£j-(*t) _£-(**) 
. J 4(^t) T 5]-(^) yi(2*) T x-(^)^4>t) 



oc exp 



{-5 




T r 









x t 

X t -1 



+ 



x f 

Xt-1 



.^(2t) T 5]-(zt)„(«t) 



m- t „ 2 ,t-i(«t-i)?'(y t -il^-i) a^t^iV^-illlt-l^Hlt-l 



5/ 



oc cxp 



{4 

















A f-1]*-1 



x t 

X t -1 



+ 



x t 

X t -1 





*— 1 1— 1 



where, similar to the backwards recursions, we have made the following definitions 



€ t = d t-i,t + C T R- l y t 



A( lt = K t -i,t + C T RT l C. 
Combining these distributions and integrating over x t -i, we have 

m t ~i,t( x t) oc N~ l {x t ; Ot-i,t,A t -i t t) 

with 

Q t _ l t = E-^V**) + S-W#)(AW T S-W4W + A^_ 1 , t _ 1 )~ 1 (^_ 1 | t _ 1 - ,4^ T £ 
or equivalently, 

flt-W = A«_ llt (M<*> + aWA^^^O 
A t _i lt = (£W + A^K-l nt _ x A^ T )-\ 
Assuming xq ~ A/"(0, Pq)> we initialize at time f = to 



-1,0 



A_ 



i,o - -P 



(88) 



(89) 



(90) 



(91) 



(92) 



An equivalent, but more numerically stable recursion is summarized in Algorithm [6] However, this algorithm relies 
on the dynamic matrix being invertible. 
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1) Initialize filter with 

A 0|0 = p o 
9 0\0 = 

2) Working forwards in time, for each t € {1, . . . , T}: 

a) Compute 

M t = A-( Zt +^ T A~ f A-( z *+^ 

J t = M t (M t + X-^)- 1 
L t = I- Jt- 

b) Predict 

A t _ ljt = Lt^Mt^Lf^ + J t ^-^jf_ x 

c) Update 

A{ ]t = A t _i, t + C^iT 1 *? 
Algorithm 6: Numerically stable form of the forward Kalman information filter. 



We now return to the computation of the likelihood of Eq. (I84I ). We note that the integral over x t -\ is equivalent 
to computing the message m t -i t t(xt) using z t = k. However, we have to be careful that any constants that were 
previously ignored in this message passing are not a function of z t . For the meantime, let us assume that there 
exists such a constant and let us denote this special message by 

mt-x t t(x t ; z t ) oc c(z t )A/" _1 (x t ; e t -i,t(z t ), A t _i )t (zt)). (93) 

Then, the likelihood can be written as 



p(y 1:T | z t = k, z\ t ) oc / m t -i,t(x t ; z t = k)p(y t \xt)mt+i,t(xt)dx t (94) 

J X t 

<x f c(k)M- l (xf, Ot-^ik), A t ^ t (k))M- l {x t -6\ v A\ t )dx t (95) 



'Xt 

ix t 

Combining the information parameters, and maintaining the term in the normalizing constant that is a function of 
k, this is equivalent to 



p(y 1:T | z t = k,z v ) <x c(AO|At-i,t(fc)| 1/2 exp (-~e^ 1)t {k) T A t -iA k )~ le t-iA k ) 

exp (-~xT(A t - ltt (k) + A b t]t )xt + xf(e t -it(k) + 9 b t]t )j dx t (96) 
To compute this integral, we write the integrand in terms of a Gaussian distribution times a constant. The integral 
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is then simply that constant term: 



p(y 1:T | z t = k,z\ t ) oc c(k)\At-i,t(k)\^ 2 exp (-^9t-i,t(kf ' k t - l>t {k)- l 6 t -i,t{k) 

|A t _i, t (fc) + A t 6 |t r 1 / 2 ex P (±(e t -i,m + e\ t ) T {k t _ u {k) + K\ t r\e t ^ t {k) + o\ t )) 

! Af- l (x t ;e t ^, t (k) + e b tlt ,A t ^ t (k) + AL)dx t 

JXt 



|A t _ l5t (fc) + A^|V2 
exp f - ^ t _ 1 , t (fc) T A t _ llt (fc)- 1 e t _i, t (fe) 

+ |(^_i, t (Ar) + ^ |t ) T (A i _ 1 , t (fc) + A^)- 1 ^-!,*^) + 6>» |t ) 
Thus, 

= i z v , yi:T ,7v,e) K ^_^k(zt +1 )c(k)\4 k) \ 1/2 \4 k) + A '| t r 1/2 

exp (-iflW'ATWflj*) + + flj |t f (Aj« + A^r^f + ^)) (97) 

We now show that c(z t ) is not a function z t . The only place where the previously ignored dependency on z t 
arises is from p(x t \ x t -i,zt)- Namely, 

. , , exp(-i/^*) T ir( Zt V^)) 
p(x t | x t -i, z t ) = | S (z t )|i/2 exponential 

= c\(z t ) ■ exponential (98) 



where exponential is the exponentiated quadratic of Eq. (I87I ). Then, when compute the message mt-i t(xt;zt) 
we update the previous message m t -2,t-i(xt-i) by incorporating the local likelihood p(y t -i \ Xt-i) an d then 
propagating the state estimate with p{x t \ x t -i, zt) and integrating over x t -\. Namely, we combine the distribution 
of Eq. (|98l ) with the exponentiated quadratic of Eq. (I88T ) and integrate over x t -\. 



mt-i t t(xt; zt) oc ci(z t ) / exponential • exponential 2 (ia; t _i, (99) 

JXt-i 



where exponential is the exponentiated quadratic of Eq. 

Since m t -2,t-i(xt-i) oc p{x t -i \ j/ 1:t _ 2 , zv.t-i), and the Markov properties of the state space model dictate 

p{x t -l | yi;t-l,Zl:t-l) = P(yt-l\ x t-l)p(xt-l I l/lrf-2> Zlit-l) 

oc p(y t _ 1 \x t -i)'mt-2 ) t-i(xt-i), (100) 

then 

p(x t -i | 2/i :t _ l5 zi:t-\) = c 2 • exponential. 

We note that the normalizing constant c 2 is not a function of zt since we have only considered z T for r < t. 

Once again exploiting the conditional independencies induced by the Markov structure of our state space model, 
and plugging in Eq. (|98l ) and Eq. (HOIK 

p(x t ,X t -l | yi: t -l,Zl:t) =p(x t -l \ X t -\ , Z t )p{x t -\ \ y V .t-\ , ^lrf-l) 

= (ci(zj) • exponential )(c2 • exponential) 

= c\(zt)c2 ■ exponential • exponential . (101) 
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Plugging this results into Eq. d99l ), we have 

m t -i,t(x t ; z t ) oc ci(z t ) 



x t . x ci(z t )c 2 



p(x t ,X t -l I yi:t-l,Zl:t)dx t -l 



oc — p(x t I yi;t-i,zi it ). 

C2 



(102) 



(103) 



Comparing Eq. dl02| ) to Eq. d93t , and noting that 

p(x t | y 1:t _ l5 zi :t ) = N~ x {x t ; t -i t t(z t ), A t -i )t (zt)), 

we see that c(zj) = ~ and is thus not a function of zt- 
Algebraically, we could derive this result as follows. 

iTT't-i,t( x t', z t) oc ci{z t ) I exponential • exponential 2 <i:Ei_i 
■fx, : 

= Cl (z t )c 3 (z t ) f m(\ Xt x l 1 ;A(z t )- 1 0(z 4 ),A(z 4 )') dsc^, 

where 6(z t ) and A(z t ) are the information parameters determined by combining the functional forms of exponential 
and exponential, and 

. . , . exp{-i/i^) T S-^)/x^)}e X p{i0(z t ) T A(zi)- 1 0(zO} 

Cl(zt)c3(Zi) = WW 2 \K{^ • 

Computing these terms in parts, and using standard linear algebra properties of block matrices, 

|A(^)| = ^-^WiA^Y.-^A^ +Af_ 1|t _ 1 ) -AW t E~WaW| 

— l^~( 2t )|IA^ 
_ \ n \\ 2 h-i\t-v 



(104) 



(105) 



A(^)" 1 



(£-(*) _ ^-^)A^k{z t y l A^ T ^-^)" 1 



A ( - Zt ^A f 

^ lv t-i\t-i 



A t-i\t-i A 



(A^j-aW'e-^'UW)- 1 



T/i "t-lt-l t-lt-1 



a/ 4( 2 *) t 



A _/ 



(106) 



where A(z t ) = (A^^T, ( 2f ) J 4( 2 *) + ^t-i\t-l^ anc ^ we nave usec ^ tne matrix inversion lemma in obtaining the last 
equality. Using this form of A(zt) , we readily obtain 



0(z t fA(z t )-'e(z t ) = + Cilt-^-ilt-i^ilt-r 



Thus, 



ci(z 4 )c 3 (^) 

which does not depend upon the value of zt- 



ex P{h^t-l\t-l^t-l\t-l^t-i\t-l^ 
|A / , IV2 



(107) 



(108) 



Appendix D 

Derivation of Maneuvering Target Tracking Sampler 

In this Appendix we derive the maneuvering target tracking (MTT) sampler outlined in Sec. IV-BI Recall the 
MTT model of Eq. (l42l . As described in Sec. IV-B[ we are interested in jointly sampling the control input and 
dynamical mode (ut, Zt), marginalizing over the state sequence x\-.t, the transition distributions it, and the dynamic 
parameters 6 = {^ k \ One can factor the desired conditional distribution factorizes as, 

p(u t , zt\z\ t , u\ t ,y 1:T , P, a, k, A) = p(z t \z\ t ,u\ t , y VT , /3, a, k, \)p(u t \zi :T , u\ t ,y 1:T , A). (109) 
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The distribution in Eq. dl09l) is a hybrid distribution: each discrete value of the dynamical mode indicator variable 
Zt corresponds to a different continuous distribution on the control input u t . We analyze each of the conditional 
distributions of Eq. ( 11091 ) by considering the joint distribution on all of the model parameters, and then marginalizing 
Xi-_T, 7r, and 8 k . (Note that marginalization over 9j for j ^ k simply results in a constant.) 

p[z t = k\z\t,u\ t ,y 1:T ,0,a,K,X) oc / 17^(^1/3, a, K)Y\p(z T \Tv z ^_ 1 )d'K 

Jtv j 

/ p{dk\X) IT p(u T \e k )d6 k T\p(x T \x T - 1 ,u T )p(y T \x T )dx 1 : T du t . (110) 

Similarly, we can write the conditional density of u t for each candidate Zt as, 

p(u t \z t = k,z\ t ,u\ t ,y 1:T ,X) oc J p(8 k | A) p(u T \0 k )d6 k J W_p{x T \x T ^i,u T )p(y T \x T )dx 1 ., T . (Ill) 

r\z T =k ^ t 

A key step in deriving these conditional distributions is the marginalization of the state sequence x\-t- In performing 
this marginalization, one thing we harness is the fact that conditioning on the control input sequence simplifies 
the SLDS to an LDS with a deterministic control input U\ : t- Thus, conditioning on iti:t-i,t+L-T allows us to 
marginalize the state sequence in the following manner. We run a forward Kalman filter to pass a message from 
t — 2 to t — 1, which is updated by the local likelihood at t — 1. A backward filter is also run to pass a message 
from t + 1 to t, which is updated by the local likelihood at t. These updated messages are combined with the local 
dynamic p(x t | x t -\,u t , 6) and then marginalized over x t and x t -\, resulting in the likelihood of the observation 
sequence y VT as a function of u t , the variable of interest. Because the sampler conditions on control inputs, the 
filter for this time-invariant system can be efficiently implemented by pre-computing the error covariances and then 
solely computing local Kalman updates at every time step. Of note is that the computational complexity is linear 
in the training sequence length, as well as the number of currently instantiated maneuver modes. In the following 
sections, we evaluate each of the integrals of Eq. (1 1 1 1 b and Eq. (1 1 10b in turn. 

A. Chinese Restaurant Franchise 

The integration over 7r appealing in the first line of Eq. (II 101 ) results in exactly the same predictive distribution 
as the sticky HDP-HMM lfT4l 

B. Normal-Inverse-Wishart Posterior Update 

The marginalization of 6 k , appearing both in Eq. (II 101) and Eq. (Ill lb . can be rewritten as follows: 

p(o k \x) p(u T \e k )de k = J p(u t \e k ) P (e k \x) J] P (u T \e k )de k 

r\z T =k r\z T =k,T^t 

oc j p(u t \6 k )p(6 k \{u T \z T = k,T / t},X)d6 k 

= p(u t \{u T \z T = k,T / t},\). (112) 

Here, the set {u T \z T = k,r ^ t} denotes all the observations u T other than u t that were drawn from the Gaussian 
parameterized by 6 k . When 6 k has a normal-inverse-Wishart prior jV1W(k, u, A), standard conjugacy results 
imply that the posterior is: 

p(u t \{u T \z T = k,r + t}, k, ti, v,A)~J\f (u t ; d, ( _ K A) ±M(ut;fi k ,£ k ), (113) 

\ K\y — a — 1) J 



where 

K = K + \{u s \z s = k,S / t}\ 

v = v + |{ii s |z s = k, s / t}\ 
R'd = n'd + u s 



U B e{U s \z B =k,s^t} 
U s £{U s \z s =k,s^t} 



(114) 
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Here, we are using the moment-matched Gaussian approximation to the Student-t predictive distribution for u t 
induced by marginalizing 9}.. 



C. Marginalization by Message Passing 

When considering the control input u t and conditioning on the values of all u T , r ^ t, the marginalization over 
all states x\-t can be equated to a message passing scheme that relies on the conditionally linear dynamical system 
induced by fixing u T , r j^t. Specifically, 



/ Y\p( X r\Xr-l,U T )p(y T \x T )dx 
J X _ 



(115) 



x / / ^-i,t-2(^-l)p(yt-il a t-i)p( s *l x t-i J t^)p(ytl»t)^,t+i(a5t)da;tda; < _i 

\X t -x Jx t 
<*P(.yi:T\ u f,U\t), 

where we recall the definitions of the forward messages m t -i : t{xt) and backward messages rnt+i,t(xt) from 
Appendix iBl For our MTT model of Eq. (l42l ). however, instead of accounting for a process noise mean at 
time t in the filtering equations, we must account for the control input u T . Conditioning on u T , one can equate Bu T 
with a process noise mean, and thus we simply replace ^ Zt ^ with Bu T in the filtering equations of Appendix IBl 
Similarly, we replace the process noise covariance term £^ T ) with our process noise covariance Q. (Note that 
although u T (z T ) ~ N(iiS Zt \ TS z ^), we condition on the value u T so that the MTT parameters {/^( Zt ),£^ t )} do 
not factor into the message passing equations.) 



D. Combining Messages 

To compute the likelihood of Eq. (11 151 ), we take the filtered estimates of Xt-i and x t , combine them with the 
local dynamics and local likelihood, and marginalize over Xt-i and x t . To aid in this computation, we consider 
the exponentiated quadratic form of each term in the integrand of Eq. (11 15b - We then join these terms and use 
standard Gaussian integration formulas to arrive at the desired likelihood. The derivation of this likelihood greatly 
parallels that for the sequential mode sequence sampler of Appendix O 

Recall the forward filter recursions of Appendix |B] in terms of information parameters 

{(9 t _i, t ,A t _i, t ,(9 t | t ,A t |J J 
and the backward filter recursions in terms of 



{0 



t+i,t> At+i t 



1 t\ti lV t\t 



}• 



Replace with But and H* 2 *) with Q where appropriate. We many then write m^t+i (xf) updated with the 
likelihood p{y t _ 1 \x t -\) in exponentiated quadratic form as: 



m t -i,t-2{xt-i)p(yt-i\x t -i, 



oc exp 





X t -1 


T r 




x t 





C T R 



+ 



-i 



C + A t „ 




l,t-2 









x t 



Xt-1 


T r 


x t 





Tn-1 



C T R 



Vt i + n-x,t-2 
o 



}■ 



The local dynamics can similarly be written as 

B T Q- l B B T Q- X A 



p{x t \x t -i,ut) oc exp 





T r 






X t -1 









A T Q^B A T Q- X A 
-Q- l B -Q- l A 



-B T Q~ 
-A T Q~ 

Q- 1 









u t 


T 


' " 




X t -1 


+ 


X t -1 
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Finally, the backward message m tt t+i{xt) updated with the likelihood p{y t \x t ) can be written as 

p(y t \x t )m t ,t+i(xt) oc exp 



H 


X t -1 


T r 




x t 










X t -1 




C T R' l C + k t ,t+i _ 




x t 








i T r 


+ 


X t -1 






x t 





C T R-^y t + 6 t , t+1 



Using the definitions 



Aj^C^C + At+i,* 



H\t 

of; 



A f t{t = C T R- X C + A t _i, t 
^ = C T R- l y t + t _ 1>t) 

we may express the entire integrand as 

mt-i,t-2(xt-i)p(y t -i\xt-i)p(xt\xt-i,ut)p{y t \xt)mt,t+i(xt) oc 

B T Q- l B B T Q- 1 A -B T Q- 1 

i f 

«— i|t— l 



exp< - 









1 








£Ct-i 




2 











A T Q- l B A T Q- l A + \f, -A T Q- 1 



-Q~ l B 



-Q^A Q- 1 + A* |t 



- x,. : Ci|t-i 
v t\t 



Xt-1 

x t 



X t -1 

x t 



Integrating over xt, we obtain an expression proportional to 



AT 1 



Xt-1 



Xt-1 



,A 



u t 

Xt-1 



with 



A 



u t 

Xt-1 



B T Q' 1 B B T Q- 1 A 
A T Q~ l B A T Q- 1 A + K } , 



t-i|t-i 





r b t q^ l 




A T Q- 1 



+ A^) _1 [ Q~ X B Q-iA 



u t 

Xt-1 



B T ^B B T ^A 
A T ^B A T ^A 





't-i\t-i 



+ 



B T Q~ 1 
A T Q~ l 



{Q- 1 + AL)-W t 



tut-i + AT Q~ lK r l0 t\t 



H\t) u t\t 

"t— l|t— 1 J L -* J 

Here, we have defined 

St = + + A^)- 1 ^- 1 = Q- 1 + Q~ X K~ X Q~ X . 

Finally, integrating over x t -i yields an expression proportional to 

N-\uJ-e(u t ),k{u t )), 

with 

A(ttt) = B T ^B - B T ^ 1 A{A T T,^ l A + A* ^"^E^B 

8(u t ) = B T Q- 1 Ki% - B T ^ l A{A T ^A + Af_ 1|t _ 1 )- 1 (^_ 1|t _ 1 + A T Q- 1 K^e\ t ). 
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E. Joining Distributions that Depend on ut 

We have derived two terms which depend on u t : a prior and a likelihood. Normally, one would consider p(u t \6k) 
the prior on u t . However, through marginalization of this parameter, we induced dependencies between the control 
inputs u T and all the u T that were drawn from a distribution parameterized by 9 k inform us of the distribution 
over u t . Therefore, we treat p(u t \{u T \z T = k,r ^ t}) as a prior distribution on u t . The likelihood function 
p{y l . T \u t ] u\ t ) describes the likelihood of an observation sequence y VT given the input sequence uu, containing 
the random variable is u t . 

We multiply the prior distribution by the likelihood function to get the following quadratic expression: 

p(u t \{u T \z T = k,r / t})p(y 1:T \ut;u\ t ) 



DC 



(27T)^/ 2 |S fc 



2 If; 11/2 6XP { " \^ Ut ~ ^kf% X {ut ~ Ajfc) 



1 



(27r)^ v / 2 |E fc | 1 /2 



exp 



(u t - A(u t )- L 9(u t )Y A(u t )(u t - A t y- L 9(u t )) 
z 

uJ{t~ l + K(u t ))u t -2uJ(t- 1 (i k 



+ 9{u t )) + ffit^fa + 9{u t ) T K{u t )- l 9{u t ) 



(2 7 r)^ 2 |(s- 1 + AM)- 1 l 1/2 



exp 



fi T k tl 1 ii k + 9(u t ) T A(u t )- 1 9{u t ) 



(27r)^/2|E fc |V2 

- (E-^fe + 9(u t )) T (£- 1 + Aiut))- 1 ^ 1 ^ + 9{u t )) 

■ M(u t ; (E- 1 + AK))~ 1 (S^ 1 A fe + 9(u t )), (t, 1 + A(u t )) -1 ) 
4 C k ■ M(u t ; (E- 1 + A(u t )y l (± k l fi k + 0(ut)), (S^ 1 + A(u t ))- 1 ), 

where we note that the defined constant C k is a function of z t = k, but not of u t . 



(116) 



F. Resulting (ut,zt) Sampling Distributions 

We write Eq. (II 101) and Eq. (Ill lb in terms of the derived distributions: 

p(z t = k\z\ t ,u\ t ,y 1:T , ft, a, k, A) oc p(z t = k \ z\ t ,P, a, k) 

/ p(w t |{u r |,e T = fc,r ^ t})p(y 1:T |tt t ;«\ t )dt* t , (117) 
= k,z\ t ,u\ t ,y 1:T ,X) ex p(u t \{u T \z T = k,r / i})p(y 1:T |tt t ; tty). (118) 
Thus, the distribution over zt, marginalizing u t , is given by 
p(z t = k\z\ t , u\ t , y 1:T , P, a, k, A) 

oc p(z t = k | z\ t ,P, a, k) I C k - M(u t ; (E" 1 + A(ut))-\t^ l £* + 0(O), (S^ 1 + A(ui)) _1 )dut 
ex C fe • p(z t = fc | Z\ t ,P,a,K). (119) 
and the distribution over ii t (for z t = k fixed) is 



= 2\t, «\t, Vi:T> A ) = A/"(«t; (E fc + A(ti«)) A (E fc V* + 0(«t)) s (E fc + A(«t)) 



(120) 



